_ o^^' PJLU^^^X i^y ^7A ©brarg nf (iIl|tB Dolumr ta tl|? gift of 3(1 jf/ffc Digitized by tine Internet Arciiive in 2010 littp://www.arcliive.org/details/firstcourseinstaOOjone BELL'S MATHEMATICAL SERIES ADVANCED SECTION General Editor: WILLIAM P. MILNE, M.A., D.Sc Professor or Mathematics, Univeksitv of Leeds A FIRST COURSE IN STATISTICS BELL'S MATHEMATICAL SERIES. ADVANCED SECTION. General Editor: WILLIAM P. MILNE, M.A., D.Sc, Professor of Mathematics, University of Leeds. AN ELEMENTARY TREATISE ON DIFFERENTIAL EQUATIONS AND THEIR APPLICATIONS. By H. T. H. PiAGGio, M.A., D.Sc, Professor of Mathe- matics, University College, Nottingham. Demy 8vo. Third Edition. I2s. net. A FIRST COURSE IN NOMOGRAPIIV. By S.Brodetskv, M.A., B.Sc, Ph.D., Professor of Applied Mathematics, Leeds University. Demy 8vo. los. net. AN INTRODUCTION TO THE STUDY OF VECTOR ANALYSIS. By C. E. Weatherburn. M.A., D.Sc, Professor of Mathematics, Canterbury University College, Christchurch, N.Z. Demy 8vo. I2s. net. ADVANCED VECTOR ANALYSIS, with Application to Mathematical Physics. By the same Author. Demy Svo. 15s. net. A FIRST COURSE IN STATISTICS. By D. Caradog Jones, M.A., F..S.S. Second Edition, revised. Demy Svo. 15s; net. Also in 2 parts. Part I., 7s. 6d: : Part II., 9s. THE ELEMENTS OF NON-EUCLIDEAN GEOMETRY. By D. M. Y. SOMMERVILLE, M.A., D.Sc, Professor of Mathematics, Victoria University College, Wellington, N.Z. Crown Svo. 7s. 6d. net. ANALYTICAL CONICS. By the same Author. Demy Svo. 15s. net. LONDON : G. BELL AND SONS, LTD. A FIRST COURSE IN STATISTICS BY D. CARADOG JONES, M.A., F.S.S. LECTURER l^f SOCIAL STATISTICS AT LIVERPOOL UNIVERSITY FORMERLY LECTURER IX MATHEMATIC AT DURHAM UNIVERSITY LONDON G. BELL AIs'd SONS LTD. 1924 Fi7-st Published, . . 1921 Second Edition, revised, 1924 Printed in Great Britain by T. nnd A. Constable Ltd. at the University I'ress, Edinburgh PREFACE Fifty years ago a large section of the general public were not only uninterested in what we now call the social problem, but they scarcely gave a thought to the existence of such a problem. They felt vaguely perhaps, during periods of acute distress due to lack of employment, that all was not well and they thought the Govern- ment or possibly the big landowner was to blame, but only the more enlightened realized the complexity of the body politic and how fearfully and wonderfully it is made. To-day all this is changed, and comparatively few imagine that a single panacea — the pro- hibition of drink, the nationalization of land, or a levy on capital — will cure all evils. The very fact that nearly the whole civilized world has given itself up for over four years to the destruction of life and the dragging down of the social fabric in all countries on so vast a scale has led to a surfeit and a reaction in which thoughtful men are eager to take part in proclaiming again a common brotherhood and in building a better world. Those who have always been interested in this kind of architecture welcome the change of spirit, but they also recognize the difficulty of the task undertaken and the need for no little mental effort to second the good-will, which is the first essential for success. To pull down no teacher is needed, but we must learn to build. This leads one to the subject of the present book. The man who wishes his work to stand must make sure of its foundations. He cannot afford to rest satisfied, as too often the politician and social . worker do, with wild and ill-informed generalizations Avhere more ^ 3 o S _5' o 6 '3 01 o o Cm 0) bo O o O > < 1867-77 . 100 100 100 100 100 100 100 100 800 100 1912. 64 68 70 79 83 85 74 101 624 78 The figures m the last column but one are obtained by simply adding the figures in the eight previous columns, and, dividing these 10 STATISTICS results by eight, we get the average index number for the group in 1912 as a percentage of that in the standard years 1867-77. Treating all the other commodities in the same way we ultimately get index numbers for all the different groups and for all com- modities combined as follows : — Index Numbers for different Groups and FOR ALL Commodities. No. of Commodities 8 7 4 19 7 8 11 45 Years. m w ri o o "cS £ "S <1 UO ID t«5C o o a i .2 1 a \ a o O 1867-77 1912 .... 100 78 100 96 100 62 100 81 100 110 100 76 100 82 100 85 The index number for ' All Food ' is obtained by summing the nineteen index numbers for the separate commodities which are included in this class and dividing the result by 19. Similarly the general index number for all commodities is obtained, not by adding the numbers for the different groups and dividing by the number of groups, but by adding the forty-five index numbers of all the separate commodities and dividing the result by 45. In The Economist the average prices of twenty-two commodities for the years 1901-5 are taken as the standard, being denoted each by 100, and the prices of the same commodities for any other year are then written as percentages of these standard j)rices ; the sum of these percentages is taken as the index number, and it is a simple matter to divide by 22 if we wish to get the average per- centage change. The following table explains the method of calculation : — Index Numbers representing Prices of Commodities Date. Cereals and Meat. Ofciif Textiles. Foods. 1 Minerals. Miscel- laneous. Total. Index No- 22. 1901-5 . End of Dec. 1916 500 1294 300 500 553 1124-5 1 400 824-5 500 1112 2200 4908 100 223 MEASUREMENT, VARIABLES, FREQUENCY DISTRIBUTION 11 In this table five commodities are included under the head of ' Cereals and Meat,' three under ' Other Foods,' and so on. The numbers in the last column are obtained by dividing those in the previous column by 22. It is clear that what is at bottom the same principle may be applied in any case of a variable changing with time when we wish to measure the extent of the change, so that the use of index numbers is not confined to the problem of prices. We shall return again to discuss one or two further points in connection with the same subject in the Chapter on ' Averages.' Frequency Distribution. So far we have been thinking more particularly of the change which an individual variable, or a col- lection of such variables, may undergo in the course of time, or the difference between two values which the same variable may have at two different instants of time, and how to measure it. Now the science of Statistics is based upon the study of the crowd rather than of the individual, although observations on individuals have to be made before they can be combined together to produce the crowd, just as individual income-tax schedules have to be completed and combined before the balance-sheet of the State can be drawn up. As we pass from one individual to another there may be great differences in the organ or character observed — hence the word variable already introduced — but in the mass these differ- ences are merged together and lose their individual importance : it is rather their resultant effect we seek to measure. In order therefore to discover this effect it is necessary to make a collection of individual observations and to analyse the results. Now if our ultimate conclusions are to be safe the number of observations must be considerable, and in order to be able to cope with them and reduce them to some sort of system the first step in the analysis consists in arranging them in different classes according to the value of the variable under consideration. It is to be noted that now we are dealing with changes in the value of a variable as we pass from one individual to another at the same period of time and mider the same general conditions, and not with the change in a variable in the same individiial occurring with the lapse of time. We wish, for example, to draw a distinction between (1) the change in wages as we pass from one man to another at the same time in the same trade, and (2) the change in wages of the same man, or class of men, in the same trade occurring in a given period of time ; in the first case we w ant to find the amount 12 STATISTICS of diversity within the trade at some stated time, and in the second our object is to discover whether an improvement has taken place in the wages of a particular individual or a particular trade with the passage of time. In picturing variation of the first type the conception arises of a frequency distribution where the observations are distributed in ordered groups, with a number corresponding to each sho^^dng how many, or how frequent, are the individuals possessing the type of variable or character which defuies that group. More generally, if a series of measurements or observations of a variable y are made corresponding to a selected series of another variable x we get a distribution, which becomes a frequency distribution when y represents the frequency of events happening in a particular way, or of individuals corresponding to a particular value of some common variable or character, represented by x. Thus (1) the boys in a school might be grouped according to their intelligence : so many, dull ; so many, of ordinary intelligence ; and so many, bright or above the ordinary. Again (2) in an inquiry into the housing of the people in any town or district it would be necessary to draw up a table showing the number or frequency of existing tenements with one room, the frequency of tenements with two rooms, the frequency of tenements with three rooms, and so on. Once more (3) a zoologist, wishing to discover whether crabs of a certain species caught in one locality differ in any remarkable way from members of the same species caught in another locality, might start by making measurements of the length of carapace or upper shell for crabs of like sex in the two places and then proceed to form frequency tables for each, setting out the frequency of crabs for which the carapace length hes, say, between 5 and 6 millimetres, the frequency with length between 6 and 7 millimetres, the frequency with length between 7 and 8 millimetres, and so on. He would then have in these tables some basis for comparing the specimens caught in the two locaUties. The three illustrations just used give three different types of distribution corresponding to the three types of variable to which attention has been drawn before. In the first, where the variable or character observed is not measurable, doubt will sometimes arise as to the appropriate class in which individuals should be placed who seem to be on the border hne between dulness and mediocrity or between mediocrity and brilliance, so that accurate classification will greatly depend upon what is called the * personal equation ' of the observer. The second illustration corresponds MEASUREMENT, VARIABLES, FREQUENCY DISTRIBUTION 13 to the case where the variable changes not continuously but by unit stages ; the choice of classes in such a case depends little upon the observer unless the unit is very small compared to the total range of variability ; for example, a tenement might either definitely have two rooms or it might have three rooms, but it clearly could not be put down as having 2 J rooms or 2g- rooms : in other words, the only natural classification is so many tenements with two rooms, so many wdth three rooms, so many with four rooms, and so on, though here too some confusion might arise through failure to define clearly Avhat is ' a room.' In the third typo, where we can conceive of the continuous variation of the character under observation, there would be nothing surprising in the appearance of any value of the variable between the lowest and highest values observed ; the choice of suitable limits for the several groups becomes therefore in this case rather a delicate matter which requires careful judgment. We shall begin the next chapter with some general remarks upon the subject of classification and tabulation. CHAPTER III CLASSIFICATION AND TABULATION No part of Statistics is of more importance than that which deals with classification and tabulation, and it is the one part for which no very precise rules can be given. A neat arrangement of ideas in the mind, capacity to express them clearly, and patience are indispensable, but experience alone will convince one of the extreme care which must be exercised if blunders are to be avoided and time is to be saved in the long run. This has to be emphasized because most people, until they have tried and failed, imagine that to arrange things in classes and in tables is a straightforward proceeding involving no great thought or trouble. Abundant matter of a statistical character is published periodi- cally in Blue-books, Government Reports, Reports of Local Authori- ties, Directors of Education, Medical Officers of Health, Chief Constables, Employers' Associations, Trade Unions, Co-operative Societies, etc., but it needs a trained intelligence as a rule to assimi- late it and turn it to further advantage. The larger the scale upon which any inquiry is made, the more valuable should the results be, granted that equal accuracy is possible on the large as on the small scale, but it is fairly clear that mistakes of various kinds have also much more chance of creeping into a large work than into a small one. To appreciate the various and numerous possibilities of error when the scope is wide it is enough to read the introduc- tions to the Registrar-General's Reports on the Census from decade to decade ; this should also impress the student with the care that is necessary if he proposes to use such material for the investigation of some other problem. It may seem a comparatively simple task to abstract two sets of figures from a Census Report, to establish a one-to-one correspondence between them, and to make deductions therefrom, but such figures when taken from their context will sometimes lead to absolutely unsafe, if not false, conclusions. The exact meaning and limitations of any data can only be properly appreciated by one who has been closely in touch with the persons who have collected them, and it is therefore important, before CLASSIFICATION AND TABULATION 15 attempting to re-classify or re-tabulate any old statistics for a new purpose, to read very carefully through the notes made by the original compilers. Perhaps the best advice that can be given to any one in this connection is that he should embark upon some small inquiry which will necessitate the collection of statistics for himself ; the final result of his efforts may seem disappointing, but the experi- ence he will gain will be invaluable. Ideas for such an inquiry will occur to him if he reads through some authoritative work on social questions, e.g. Beveridge's Unemployment, the decennial Census Reports, or The Miiiority Report on the Poor Law (1905). But he must read with an open and critical mind, questioning particularly the foundation for all statements as to cause and effect which may be made. A few simple hints may be useful as to method of procedure. When he thinks he has discovered some subject of interest which would appear to deserve examination, it will be well to put it down on paper in order to get it clearly defined, because a precise written statement is Hkely to carry one further than a shadowy idea somewhere at the back of the mind which is hardly formu- lated at all. When the actual collection of statistics is begun it will almost certainly be found that it is impossible to solve the original problem contemplated ; but that need not prevent further progress — what is important is that the limitations should be exactly realized, and this will be impossible unless the original problem is clearly presented side by side Avith the nearest solution obtainable. The problem stated, the next thing is to set down categorically a number of questions, the answers to which are to be the raw material for the solution of the given problem. For the answers let us assume the inquirer is dependent upon the goodwill of others, either employers, or trade union secretaries, or public officials. The questions in that case must be clearly, concisely, and courteously phrased, and must not be capable of more than one interpretation. In numbcK they should be few and in character not inquisitorial ; moreover, the replies should be obtainable without any great labour on the part of the persons approached. Here again it will be found that the questions first set down are not all satisfactory : one will be too vague ; another, though clear enough, may involve a con- siderable search through a mass of other matter before it can be properly answered ; while to another it might be impossible to give an exact reply in any case. Revision and amendment may there- 16 STATISTICS fore be necessary in the light of the first replies received, and the inquirer will begin to see at this stage how far the solution to his original problem is really possible. Wlien the bulk of the returns have come in they should be critically examined one by one. A number will, for one reason or another, be worthless, and they must be discarded ; as for the remainder, if the questions were well chosen, the answers should not be difficult to interpret and classify ; the most successful questions are those to which a simple ' yes ' or ' no ' in reply gives all the information required ; numerical answers are less easy to deal with, especially if there is the least chance of misunderstanding on either side as there often is, for example, in the case of observations which are on the border line between two classes. Tables should then be dra^vn up and the headings to the dinerent columns of the tables should state concisely and exactly what the figures below represent. So far as possible any one should be able readily to grasp their general meaning without being obliged to wade through a page or two of written explanation ; if any heading cannot be clearly expressed in a few words it may be helped out by a further note at the bottom of the page, but too many such notes are to be avoided. Finally, a summary should be made of the various conclusions suggested by a study of the tables. Some of the points raised in the course of the inquiry will perhaps be only incidental to the main problem under discussion, but may still deserve a passing reference. It will also be of advantage to follow up the summary by any recommendations which can be fairly based on the con- clusions obtained, when the problem is such that recommendations are expedient, and, if ultimately the whole is of sufficient value to be printed, emphasis can be introduced where necessary by suitable variations in type. For this part of the work considerable judgment is necessary which can only be acquired by long training — a faculty to pick out the real from the false and an eye to distinguish the important from the trivial. A sense of numerical proportion too is desirable inci- dentally ; one of our leading exponents on finance in a book deaHng with the meaning of money uses a very interesting illustration which is perhaps worth quoting here to show how even an acute mind may on occasion prove itself curiously lacldng in such a sense. He is seeking to show how the credit system of the country is built upon a foundation composed of a little gold and a lot of paper ; for this purpose he amalgamates together the balance-sheets of half CLASSIFICATION AND TABULATION 17 a dozen big banks, and ijrovcs that their liabili,ties on current and deposit account amounted at a certain date prior to 1914 to 249 million pounds, while the cash in hand and at the Bank of England was 43 millions. Of the 43 millions he estimates that roughly 20 millions would be cash in the Bank of England, and further that about two-thirds of this 20 millions would be represented really by securities and not by gold. Hence he concludes that to support this vast erection of credit there would only be £6,666,666 of actual gold. Thus after talking throughout in millions the author closes by giving his result true apparently to a pound ! Much may be learnt as to methods of classification and the drawing up of tables by a careful study of those which appear in various official reports, and a few such tables are reproduced in the pages which follow. Table (1). Condition as to Cleanliness of School Children in Surrey. Cleanliness. 5 years, 1908-12. 79,070 children inspected. Above the average . Average Below average Much below average 15-4 per cent. 76-5 7-6 0-5 Table (2). Condition as to Infectious Diseases op School Children at Different Ages in Surrey (1913). Age Groups inspected 5-6 8-9 13-14 Total at All Ages. Numbers inspected 5,191 5,151 4,902 15,304 Proportion who before inspec- tion had suflEered from — per cent. per cent. per cent. per cent. Diphtheria . 1-3 3-5 5-4 3-4 Scarlet fever 2-7 7-2 10-9 6-9 Measles 55-3 79-3 84-6 72-9 Whooping cough 41-8 56-4 54-3 50-9 German measles 2-9 51 7-5 51 Chicken pox 261 401 38-6 34-9 Mumps 10-6 220 29-8 20-7 No infectious diseases . 18-9 61 4-7 1 100 No definite information 3-3 2-2 0-9 2-2 18 STATISTICS Table (3). Height of School Children according to District, Age, and Sex (1913). Age Groups. Boys. Girls. Nos. measured. Average Height in inches. Average in c Surrej'. Height ms. England and "Wales. Nos. measured. Average Height in inches. Average Height in cms. England Surrej-. and Wales. 5-6 8-9 13-14 2724 2578 2529 41-4 47-8 57-0 105-2 121-4 144-8 103-4 120-4 142-4 2467 2573 2433 41-3 47-5 57-9 104-9 120-7 147-1 102-6 119-4 144-2 The first four are taken from the Annvxil Report of the School Medical Officer for the County of Surrey, 1913. The first is an example of single tabulation showing the distribution according to cleanliness of children inspected in the elementary schools. The second is an example of double tabulation, showing the distribu- tion according to age of school children who at some period before the date of inspection had suffered from certain infectious diseases. The third is an example of quadruple tabulation, showing the dis- tribution of school children according to height, district, sex, and age. Thus in the first case we have one factor brought into reUef, viz. cleanliness ; in the second case we have two factors, age and disease ; in the third case we have four factors, height, district, sex, and age. When we have two or more factors tabulated together as in cases (2) and (3), we may be sometimes led to discover a connection of some kind, possibly causal, between them, and the search for such a connection, or correlation as it is called, represents one very useful purpose to Avhich tabulation may be jjut. Table (4) is an illustra- tion of this. It is the result of certain measurements carried out in order to discover the effect of employment out of school hours upon the physical concHtion of boys. The particular factor examined as the possible cause of evil in this comiection is lack of sleep, and the figures given certainly seem to warrant a closer examination into the matter. CLASSIFICATION AND TABULATION 19 Table (4). Physical Condition of certain Boys according TO Hours of Sleep Obtained. No. of Hours Sleep obtained. No. of Boys examined. Average Height in inches. Average "Weight in lbs. Nutrition. Percentage above average. Percentage average. Percentage below average. 7 to 8 . 8 to 9 . 9 to 10 . 10 to 11 . 11 to 12 . 14 80 296 280 50 54-5 55-4 56-4 57-9 59.0 71-3 73-9 79-3 83-2 87-0 71 101 15-3 22-8 220 35-8 65-9 64-5 66-5 68-0 57-1 240 20-2 10-7 100 Tables (5) and (6) are two illustrations of neat tables, containing a large amount of information in a small space, set out in such a form that the eye can easily take it in— and that is the main purpose of tabulation. These examples are selected from the Sixteenth Abstract of Labour Statistics of the United Kingdom, Cd. 7131. In Table (6) note the classification of age groups : it is not ' 5 to 10 years,' ' 10 to 15 years,' and so on, but ' 5 and under 10 years,' ' 10 and under 15 years,' and so on. This removes difficulties at the border lines between two classes ; the difficulties are not com- pletely removed, however, unless there is some understanding as to what shall constitute under any particular age. Shall it be six months under, or one day under, or one hour under ? This sort of ambiguity has more importance in some cases than in others. Suppose, for example, we were classifying men according to their height : a group of the typo ' 60 inches and under 62 inches,' assuming that measurements were made to the nearest half-inch, would really include all men who were ' 59f inches and under 61 1 inches ' ; because one who measured anything from 59| in. to 60 J in., being nearer to 60 in. than to 59| in. measuring to the nearest half-inch, would be registered as 60 in. in height, while one who measured anything from 61| in. to 62J in., being nearer to 62 in. than to 61 1 in., would be registered as 62 in. in height. Another point to be noted is that in general people making returns seem to have a psychological weakness for round figures, so that a man in the neighbourhood of 40 years of age, for example, is apt to record himself as actually 40 although he may really ^0 STATISTICS Table (5). Classification of Overcrowded Tenements—* England and Wales (1911). Urban Districts. Rural Districts. Total. Occupauts thereof. Occupants thereof. Occupants thereof. Tenements WITH No. of Over- crowded Tene- ments. No. of Over- crowded Tene- ments. No. of Over- crowded Tene- ments. No. Per- cent- age of total No. Per- cent- age of total No. Per- cent- age of total popu- lation. 0-7 2-5 3-0 2-2 popu- lation. popu- lation. 0-6 2-2 2-8 2-2 1 room . 2 rooms . 3 rooms . 4 rooms . 56,290 119,69.5 107,892 64,470 206,022 712,613 847,937 624,747 1,545 15,397 22,380 17,341 5,748 91,458 175,988 167,969 0-1 1-2 2-2 2-1 57,835 135,092 130,272 81,811 211,770 804,071 1,023,925 792,716 5 or more rooms . 1 21,200 251,405 0-9 4,700 55,585 0-7 25,900 306,990 0-8 Table (6). Population grouped according to Age — England and Wales (1911). males. Age Groups. Uruan Districts. Rural Districts. All Districts. Number. Percentage. Number. Percentage. Number. Percentage. Under 5 years 5 and under 10 vears 10 „ 15 „ 15 „ 20 „ 20 ,, 30 „ 30 „ 40 ,, 40 ,, 50 „ 50 „ 60 „ 60 „ 70 „ 70 and upwards 1,517,432 1,431,900 1,341,586 1,267,500 2,332,135 2,094,934 1,-556,818 1,042,868 612,741 296,240 11-3] 10-6 1 ^ „ 9-9 r ^^ 9-4 ) 17-3^ 1.5-5 144-4 ll-OJ 7-71 4-5 U4-4 2-2j 418,681 415,395 400,045 387,395 626,300 542,370 444,360 333,308 2.30,306 147,228 10-6^ 10-3 P^ ^ 9-8J 15-9^ 13-7 HO-9 ii-sj 8-4^ . 5-8yi7-9 3-7J 1,936,113 1,847.295 1,747,631 1,654,895 2,958,4.35 2,637,304 2,001,178 1,. 376. 236 843,047 443,474 ii-n 10-0 Li .9 10-0 1 ^^ - 9-5j 17-0^ 15-1 U3-6 ll-oj 7-9-1 4-8 } 15-2 2-5 j Total 13,494,100 1000 3,951,448 100-0 17,445,608 100-0 * For the purpose of the Census Reports 'ordinary tenements which have more than two occupants per room, bedrooms and sitting-rooms included,' are considered overcrowded.] CLASSIFICATION AND TABULATION 21 be 39 or 41 years old. To diminish the error arising from this fact it is usual, when not otherwise inconvenient, to fix the centres of the class-intervals at round figures : e.g. to take ' 15 and under 25 years,' ' 25 and under 35 years,' etc., in preference to * 20 and under 30 years,' ' 30 and under 40 years,' etc. Where there is any known bias in the data, as, for instance, in the familiar case of certain women who consistently register themselves as younger than they really are, a correction can be made in the final figures. In any frequency distribution where we wish to group a number of observations according to the magnitude of some common variable, as in Table (6) a number of males grouped according to age, the question arises — ' How many grouf)s should there be ? ' With tliis question is involved also the size of the corresponding class-interval, and this should be so large that, with possible excep- tions at either extremity of the table, there are a fair proportion of observations to each class or group ; and, contrariwise, it should be so small that all the observations in any one group may be treated practically as if they were located at the centre of the group so far as the variable in question is concerned, e.g. it should be possible to treat males recorded in class ' 50 and under 60 years,' where the interval is 10 3^ears, as if they were all of age 55 years. It will be found in general that a number of groups somewhere in the neighbourhood of 20 is the most satisfactory, granted that the number of observations is reasonably large, although in some cases it is impossible to split up the unit of class-interval, and we are obHged to be satisfied wdth a smaller number of groups on this account : Table (5) is a case in point where we are tied down to one room as the class-interval. In Table (6) the class-interval varies, being only 5 years at first, and afterwards 10 years, but as a rule the labour of calculation of the different statistical constants we require is considerably simplified if it is possible to keej) the size of the class-interval the same for each group. CHAPTER IV AVERAGES Common Average or Arithmetic Mean. Let us consider one of the commonest meanings of the term average. If a train travels a distance of 180 miles in 3 hours we say that it has been moving at 60 miles an hour. By this we do not mean that its speed is always 60 m/h, never more, never less, but that if it had moved always at that uniform speed it would have accomplished its journey in exactly the same time. As a matter of fact, during some instants it may have been moving at a much slower rate than 60 m/h, but, if so, it must have made up for this slackness by travelling at a much faster rate than 60 m/h during other instants, so that on the whole a balance was effected, and, as we say, the speed averaged out at 60 m/h. Again, suppose the wages of three men are : A, 27s. a week ; B, ISs. a week ; C, 30s. a week. We should say that the average wage of the three was equivalent to |(27+18+30)s. = 2os. a week. In other words, if A, B, and C were all under the same employer, and if, instead of paying them different amounts, he wanted to pay them all equally, he would have to give each man 25s. a week, assuming that his total wages biU was to remain unaltered. This method of measurement gives what is kno\\Ti as the arithmetic mean, or, more simply, the mean. Once more, in discussing the state of the labour market as regards different trades, when we wish to compare one with another, it is not the actual numbers unemployed in each trade that are quoted, but these numbers expressed as percentages of the total numbers employable in each trade. In each of these three cases we reduce our observations or measurements to a sort of common denominator, so that they may be mentally compared or contrasted more readily with other observa- tions of a similar character. Thus we have in mind a certain mean 22 AVERAGES 23 train speed per hour, or mean wage per week, or mean percentage out of work, as the case may be. An average then in general we may regard as one of a class of statistical constants (others of which we are to meet later) which concisely label a set of observations or measurements pertaining to a common family. It is designed to describe the family type more nearly than is possible by observing any chance member, and in value it should therefore come somewhere near the middle of the family group, so that if the individual members of the family chance to be equal each to each in respect to the organ or character observed it should have the same value as they have. This consti- tutes a test for the validity of any fornmla giving the average of a set of observations : e.g. we might, if we wish, define the average of three numbers, p, q, r to be, not \{'p-\-q-\-r) but for (1) this formula, too, can be shoA\Ti to give a number intermediate in value between the greatest and least of the numbers p, q, r ; also (2) if we put p=q=r=k (say), the formula reduces to Clearly the range of choice for the definition of an average is infinite, though only a few definitions give averages which have proved their utility and come into general use. Of these the most important is the common mean already introduced, with its ex- tension, the weighted mean, but at least two others deserve special consideration, the median and the mode. Median. In any observed distribution if all the individuals can be arranged in order of magnitude of the character or organ observed, which may be conveniently done when they are not very numerous, the median organ or character will be that pertaining to the individual half-way along the series, so that there are in general an equal number of individuals above and below the median. For instance, if seven boys of different heights be placed to stand in a row, the tallest first, the next tallest next, and so on, the median height is the height of the fourth boy from either end. If there are an even number of boys, say eight, it would be natural to take as median the height midway between that of the fourth and that of the fifth boy. When the items are numerous they are frequently grouped into classes, as we have seen, such that all in the same class are reckoned 24 STATISTICS to have some value lying between the extreme limits of that class. We should then, as before, halve the total number of observations to fix the particular individual which defines the median organ or character. This would enable us to pick out the group in which the median lies, and on reference to the original record of observa- tions, assuming it was at hand, it would be a simple matter to identify the median. If the original record be not available, however, it will be neces- sary to jDroceed to get the best value we can for the median in some other way. Consider, for example. Table (7), showing the distribu- tion of marks obtained by 514 candidates in a certain examination. We begin by rearranging the data in the manner shown below Table (7). Now in accordance with the definition the median in marks should, strictly speaking, be midway between the marks assigned to the 257th candidate and the marks assigned to the 258th candidate : in fact, the marks corresponding to candidate number 257-5, if it were possible for such a candidate to exist. But we are ignorant so far as Table (7) goes of the marks gained by either the 257th or the 258th candidate, though it is possible, by the simple proportional process known as ' interpolation,' to calculate approximately the marks we require. We think of all the candidates as forming an ordered sequence, ranged one after the other according to their marks just like the boys of different heights, and the table shows that in this mental picture the 231st candidate gets approximately 30 marks, while „ 318th „ „ „ 35 „ Hence candidate number 257-5, if one existed, ought to get a number of marks somewhere between 30 and 35. But, in tliis neighbourhood of the sequence, a difference of (318-231) candidates corresponds to a difference of 5 marks, therefore a difference of (257-5-231) candidates corresponds to a difference of (.^Vx 26-5) marks. '^ Thus the marks obtained by candidate number 257'5 are ap- proximately = 30+ -ir X 26-5 =31-523, and this may be taken as the median. On examining the actual marks-sheet it v/as found that 252 candidates obtained 31 marks or less, and 273 candidates obtained /•«. AVERAGES 25 32 marks or less, so that the real median was 32, because this was the number of marks gained by both the 257 th and the 258th candidates. The number 31-523 found above, however, would be a good approximation to take for the median when all the informa- tion at our disposal was that showTi in Table (7). Table (7). Makks obtained by 514 Candidates in a CERTAIN Examination. Marks Obtained. No. of Candidates. Marks Obtained. No. of Candidates. 1 to5 6 to 10 11 to 15 16 to 20 21 to 25 26 to 30 31 to 35 5 9 28 49 58 82 ' 87 36 to 40 41 to 45 46 to 50 51 to 55 56 to 60 61 to 65 79 50 37 21 6 3 Total 514 The table is to be read as follows : — 5 candidates obtained 1, 2, 3, 4, or 5 marks, 9 6, 7, 8, 9, or 10 and so on. By straightforward addition it can evidently be rearranged so as to read thus : — 5 candidates obtained not more than 5 marks. 14 42 91 149 231 318 397 447 484 505 511 514 10 15 20 25 30 35 40 45 50 55 60 65 26 STATISTICS It will be noted that in calculating the median no use is made of the marks of any of the candidates except those in the two groups in the immediate neighbourhood of the median, and it is one of the great advantages of this average that it can be found when an exact knowledge of the characters of the more extreme individuals in the series is not in our possession, and even wtien their measure- ment is impossible : it is enough if they can be roughly located. The arithmetic mean on the other hand is often unduly influenced by abnormal individuals which are not really typical of the popula- tion in which they appear. Mode. If we measure or observe some organ or character for each individual in a given population, the mode, as its name sug- gests, is simply the organ or character of most fashionable or most frequent size. A large draper, for example, will have collars of several different shajies and sizes in his shop, but the fashionable shape and the predominant size correspond to the mode : it is the mode that sells most readily, and the intelligent draper will always have it in stock. Again, in Table (2), the disease mode or fashion- able disease among certain school children inspected in Surrey in 1913 was measles, for a greater percentage of children had suffered from measles than from any other of the diseases recorded. Now when the variable in which we are interested is ' discrete,' that is, when it changes by unit steps, leading to classes like ' tene- ments with 1 room,' ' tenements with 2 rooms,' ' tenements with 3 rooms,' and so on, it is an easy matter to pick out the class of greatest frequency : thus, in Table (5) there are more overcrowded tenements with 2 rooms than with any other number of rooms in the urban districts, so that 2 is the mode so far as this character (number of rooms) is concerned, whereas in the rural districts 3 is the mode, for there are more overcrowded tenements with 3 rooms than with any other number. There may be ambiguity, however, in determining the mode in this way for a grouped frequency dis- tribution when we are dealing with an organ or character subject to ' continuous variation.' To cover such cases the modal value has been defined as that value for which the frequency per unit variation of the organ or character is a maximum. The precise significance of this Avording will only be appreciated after discussmg frequency curves : at present it must suffice to give a practical illustration of how the ambiguity arises and calls for some more refined treatment. For this purpose turn again to the examination marks in Table (7), AVERAGES 27 from which it appears that the mode, if it is to be the marks obtained by the greatest number of candidates, should he in the group (31 to 35), since there are 87 candidates ^vith marks between these Hmits, and this number exceeds that in any other group. But how are we to decide the exact point in the interval (31 to 35) which is to correspond to the mode ? Shall it be 33 ? We might say ' yes ' if the distribution were perfectly symmetrical on either side of the (31 to 35) group, but if we examine the neighbouring groups we see that the balance leans rather more heavily to the (26 to 30) group with a frequency of 82 than to the (36 to 40) group with a frequency of 79, and we might allow for this by interpolating in some way— ignoring, of course, any errors which may occur in the frequencies themselves owing to the observations being generally limited in number. But the pull in the direction of lower marks becomes still more pronounced to our minds when we contrast also the frequencies in the next groups on either side, namely 58 and 50. So we might go on until the influence of the whole field of observations comes into action. Now it so happened that in this particular case the original marks-sheet was to be seen, and a regrouping of the candidates as in Table (8) makes it clear that the value found in this way for the mode may be artificially displaced sometimes to a serious extent by the particular method of grouping adopted. Thus, according to this new arrangement, the mode would seem to lie in the interval (28 to 32), the mid-value of which differs materially from 33, the mid- value of the previous maximum frequency group. Table (8). Marks obtained by 514 Candidates in a CERTAIN Examination (Alternative Grouping). Marks Obtained. No. of Candidates. Marks Obtained. No. of Candidates. 3 to 7 8 to 12 13 to 17 18 to 22 23 to 27 28 to 32 33 to 37 10 17 35 56 47 108 74 38 to 42 43 to 47 48 to 52 53 to 57 58 to 62 63 to 67 73 45 31 12 3 3 Total 514 28 STATISTICS [It should be observed that while an alteration of the grouping may also affect the median, it does not affect it nearly to the same extent : e.g. the median determined from Table (8) is 31-3, which differs little from 31-5 the value obtained by the first grouping.] If, again, we combine the results of our two groupings to find the mode we might be tempted to conclude that it lies somewhere between the limits 31 and 32, but on examining the original records it was discovered that the real mode was 28. The frequency distribution of candidates in this neighbourhood was in fact very interesting ; it ran as follows : — Number of candidates who obtained 25 marks=14 26 „ -=10 27 , , = 6 28 , =33 29 , , =17 30 , , =16 The explanation of this peculiar distribution seemed to be that 28 marks were required for a candidate to pass, and apparently as many candidates as possible were pushed over the pass line : if, on the first marking, a candidate was found to want only one mark to pass, the examiner presumably looked through his paper again and did his best to find an answer which by kindly treatment might be granted an extra mark. The effect of this leniency was ultimately to leave only 6 candidates in the division immediately below the pass line, and to swell the number immediately above to 33, which thus made 28 easUy the ' most fashionable ' mark of any, the next largest group of candidates being only 21. It will be observed that even a candidate who wanted 2 marks to pass was treated in the same tolerant fashion, although it is not so easy, of course, for a conscientious examiner to discover two extra marks as it is to discover one ; and if the candidate is 3 marks below the pass line it is still harder to give him the necessary lift to carry him over. Thus in the final list we fimd more condidates with 26 marks than with 27, and stUl more with 25 than with 26. If the above diagnosis is correct, and all marks-sheets tell the same tale, who shall again say that examiners do not temper justice with mercy ? This example has illustrated fairly clearly the difficulty of fixing the mode with any great precision by mere inspection when the individuals are arranged in groups, the value of the variable under discussion lying between prescribed limits for each group. While AVERAGES 29 it is possible to get a rough approximation to its value in this way, we conclude that for a really satisfactory determination we require some method which makes use of the whole distribution, as in the determination of the mean, and not merely of the portion in the supposed neighbourhood of the mode. This must be left to a later chapter ; we shall only point out before passing on that there may sometimes be more than one mode in a given frequency dis- tribution just as there may be more than one fashionable type of collar which it is expedient for the draper to stock in large quan- tities. The second grouping in the examination example suggests such a possibilitj'', for it will be noticed that the frequencies of candidates do not rise steadily to a single maximum at 108 for class (28 to 32), and then fall steadily : there is a previous rise and fall in the neighbourhood of class (18 to 22). Weighted Mean. Let us suppose a farmer employs for the harvest 5 men, 3 women, and 4 bo3'S. In estimating the amount of work they can do in a given time it is clear that in general a woman or boy cannot be reckoned as equal to a man. He must therefore decide what ' weight ' must be given to each in proportion to a man. If a woman's work be taken, for example, to be three- quarters as effective and a boy's work to be half as effective as that of a man, we have as the appropriate proportional weights 1 : f : 1 or 4 : 3 : 2. Hence 5 men, 3 women, and 4 boys would on the average be equiva- lent in output to (5+3xf+4x^) men 4x5+3x3+2x4 = men =9i men. An average of this type is called a weighted mean, 1, f, and I being the weights, because they tell us what weight to give to each separate worker in calculating the average. Let us consider the effect such weighting has in general upon a mean, and for this purpose we shall test it on a set of index numbers measuring rents in certain groups of to^Tis in 1912, as given in a Report on the Cost of Living of the Working Classes issued by the Board of Trade (Cd. 6955). 80 STATISTICS Table (9). Mean Index Numbers of Rents for certain Geographical Groups of Towns in 1912 (with reference TO Middle Zone of London as standard — 100). (1) (2) (3) (4) (5) (6) Geographical Group. Rents. No. of Towns included in the Group. Each Group counting as 1. Arbitrary Weights. Approxi- mate sub- multiples of Nos. in previous column. Northern Counties and Cleve- land .... 660 9 27 3 Yorkshire (except Cleveland) Lancashire and Cheshire 58-5 56-9 10 17 54 45 I Midlands .... 52-3 14 125 14 Eastern and East Midland Cos. 53-4 7 63 7 Southern Counties 63-7 10 14 2 Wales and Monmouth . 64-8 4 22 2 Scotland .... 62-0 10 178 20 Ireland .... 51-7 6 55 6 Average •• 58-4 58-8 57-6 57-6 The first mean in the above table, 58-4, is obtained by multipl}^- ing (or weighting) the mean rent of each geographical group by the number of towns in the group, given in col. (3), adding the numbers so obtained, and dividing the total by the total number of towns, thus : — 9(66-0)+ 10(58-5)+ • • • +6(51-7) 9 + 10 + + 6 This is simply the arithmetic mean treating each town as unit. The second mean, 58-8, is obtained by adding the mean rents of all the groups and dividing by the total number of groups, thus : — 66-0+58-5- +51-7 1 + 1 + + 1 This is the arithmetic mean treating each geographical group as unit. The third mean, 57-6, is obtained by multipl3dng, or weighting, the mean rent of each group by a perfectly arbitrary number given in col. (5) ; the numbers selected were taken quite at random from AVERAGES 31 another column of figures in another Blue-book, and had no coa- nection whatever with the subject of rents ; this gives : — 27(66-0)+54(58-5)+ . . . +55(51-7) 27 + 54 + . . . + 55^" The last mean, 57-6, is obtained by choosing as weights any numbers (and for simplicity we choose the smallest) as in col. (6) which are very roughly proportional to the arbitrary weights used in the last instance ; we thus get : — 3(66-0)+6(58-5)+ . . . +6(51-7) 3~T 6 + . . . + 6~' Now the first of these means is clearlj"^ the most satisfactory, since it is the result of very properly weighting the mean rent of each group of towns accordhig to the number of towns the group con- tains. But the second result shows that if we are ignorant of the number of the towns in each group we shall not be very far out in our calculation if we treat them all as of equal importance, and find the simple arithmetic mean of the mean rents in the nine groups. We can even go further, for we find, from the third and fourth results, that by weighting the mean rents in the various groups on quite a random basis, the mean we get still does not differ very greatly from the best value first found. The important princi2:>le of which the above example is an illus- tration is perfectly general, and may be stated as follows : If the total number of measurements or observations be not very small, and if the resulting values of the organ or character measured (rent in our case) be not very unequal, any reasonable selection of multipliers or weights (as, for instance, the first two adopted above) will give means which differ from one another by but little ; and even an apparently unreasonable selection of multipliers (as, for instance, the third adopted above), assuming they are not so wildly chosen as to give any particular group a very unfair weight in comparison with the others, will not throw the mean out badly. Further, in place of a set of large multipHers we may substitute small numbers which are roughly proportional to them (as we have done in the fourth case above), and the mean will again be very little affected. [See Appendix, Note 2.] CHAPTER V AVERAGES {continued) Applications of Weighted Mean. In determining the weighted mean of a set of obs(}rvations it is usual, of course, to weight each observa- tion according to its importance, though what number should be chosen as a measure of its importance may sometimes be a matter of doubt. It is not a very difficult matter to decide when we wish, for example, to compare birth, marriage, or death rates in two districts, if we know how the constitution of the population in the one district differs from that in the other, for the weighting in each of these cases must be in proportion to the population concerned, and it is too important to ignore. Death rate, crude and corrected. Imagine a city in which the total number of deaths in a certain year is N out of a population numbering P. The ordinary or crude death rate for that city wiU then be N — X 1000, by defuiition. Now this number N may be analysed according to the ages of the people who have died ; let us suj^pose it is made up of Wj people between limits 0 and less than 5 years of age, '^2 '» !> )> ^ " ^^ »» *^3 " " " 1^ >' ■^^ »> and so on, where ^1 + ^2 + ^3+ ... =N. Again the number P may be analysed accorcUng to the ages of the people who compose the total population, giving, saj^ jp^ of the population between limits 0 and less than 5 years of age, Vz J) )) >> >> 5 ,, 15 ,, Vz " " " '> 1" " ^^ " and so on, where Vi-\-Vz-\-Pz+ ■ • • =P- AVERAGES 33 Thus we may write for the crude death rate N D=--XlOOO P _W, + W2 + W3 + X 1000 =*Lii000+*^1000+-n000+ . . ; =^/!!i 1000 V^f-^ loooV^f "^ loooV . . . = (Pl<^l +^2^2+^3^34- • • O/P. where d^is the death rate between limits 0 and less than 5 years of age, ^2 »> " " ^ '» *". >> and so on. Now if we compare this expression with the corresponding one for another city, say, it is quite conceivable that the death rates in the various age groups might be equal — d^=(i , d ^d , d =d . . . I l' i! 2' .! 3 and yet D might exceed D' because in the first city there are a greater proportion of infants or old people, on which classes the hand of death falls heaviest, that is, because the ^'s or weights wliich multiply the biggest d's are greater in the first case than in the second. But so long as the d's in the two cities are equal, age group by age group, it would be reasonable to regard the cities as equally healthy, or unhealthy as the case might be, and therefore to insure a fair comparison it is usual in the Reports of the Registrar- General to give a corrected death rate in place of the crude death rate defined above. This is done by weighting the death rate for each age group, not in proportion to the actual number of persons in that group in the city itself, but in proportion to the corresponding number in C 34 STATISTICS the country at large. Thus, if we denote the proportion of the population, Q, between limits 0 and less than 5 in the country at large by QilQ, „ 15 „ 25 „ „ „ qJQ, and so on, we get as the corrected death rate a form wliich has the effect of making the results agree in two cities which have equal d's throughout. A similar method of correction is clearly applicable in consider- ing the incidence of the death rate when we are concerned not with a difference of district but with a difference of sex, occupation, religious profession, wage-earning capacity, or any other well- defined character. Further, it may be used also in comparing birth rates, marriage rates, heights, weights, chest measurements, or any similar attributes, when it is necessary to refer the observations or measurements to a standard population in order to avoid complications due to age variation. There is another method of correction, equally general in applica- tion, which is useful when the death rates in the various age groups are not laiown. In this case D, the crude death rate for the whole population of the district is known, alsopJ'P, pJP, P3/P, ■ . • the proportions of the population between the various age limits, but d^, do, d^ . . . are supposed unknown. Now if the population in the country as a whole were the same in corresponding age groups as it is in the district under consideration, we should get as the death rate for the whole country where 8^, S,, S3 • • • are the death rates in the various age groups in the country at large, and these would in practice as a rule be known. The actual death rate for the whole country is, however, {qA+Q2K+Qzh+ • • • )/Q. where g'j/Q, q^lQ, 33/Q • • • denote, as before, the real proportions of the population in the various age groups in the country at large. We take as the corrected death rate required for the district a number bearing to the crude death rate the same ratio as i(lA-\-Q2^2+ ' • O/Q bears to {PiSi+jy-^S^-^ . . .)/P. AVERAGES Hence we have corrected death rate qi^i-hQi^^-^- 35 D PA+P2^2 + X Q* Index Numbers to compare Household Budgets. Another highly important illustration of a weighted mean occurs in the search for a satisfactory measure of the change in the cost of living from year to year. We have already introduced the subject of variation in wholesale jirices, and we have seen that Sauerbeck, in forming his index numbers, treats as one each of the forty-five commodities he uses to measure this variation : the observations, that is to say, are not weighted. But, confining our attention to food alone, supposing we have five items, such as bacon, bread, tea, sugar, milk, for Avhich the index numbers of prices at two different dates are : — Bacon. Bread. Tea. Sugar. Milk. First date Second date 100 117 100 95 100 94 100 102 100 109 Is it really right to treat each of these items as of equal importance with the rest, or ought we to regard bread and tea, say, as of more weight than bacon, and count bread perhaps five times and tea three times wliile counting bacon only once ? It is clear that, in order to select a reasonable set of multipliers in this case, we should need to know the standard of living of the class of people under consideration, and how much in the aggregate they spend upon bacon and how much upon bread, etc. A partial answer to these questions can be obtained by maldng a collection of household budgets as was done, for example, by two Government Committees which recently reported (1918-19) on the Cost of Living among the Urban and the Agricultural Working Classes respectively. If the number of commodities employed is large, even an arbitrary set of multipliers, as we have indicated, will not displace the mean any great distance from the value when reason- able weights are chosen, but unfortunately in collecting such house- hold budgets we are confined to the comparatively limited variety of food- stuffs which are in general use. Different principles may be followed in making the comparison 36 STATISTICS between one year and another which may be illustrated by a few figures from the Urban Classes Report (1918) : — Table (10). Household Budgets showing Prices of each Com- modity AND Quantities Purchased at Two Different Dates by Typical Family. Commodity. First year (1914). Second year (1918). Xy «i a-.i n.i Price (pence No. of lb. Price (pence Xo. of lb. per lb). bought. per lb.) bought. Sugar 2-2 5-9 7-07 2-83 Tea . 21-3 0-68 33-3 0-57 Potatoes 0-7 15-6 1-25 200 Let a*i be the price, in pence per unit, of any one commorlity at the first date, and let n^ be the number of units of this commodity bought per week by a typical family {n may be estimated in different ways, e.g. (1) by dividing the total number of units bought by all families by the total number of those famihes, or (2) by ranging the different amounts bought by different families in order of magnitude and picking out the median amount, or (3) by choosing the mode, i.e. the amount most commonly purchased). Also let x^ be the price, in pence per unit, of the same commodity at the second date, and let Wg be the number of units of the commodity then bought per week by the typical family estimated in the same way as before. The actual expenditure, measured in pence, at the two dates will then be E{x{n,^) and ^{x^n^) respectively, where Z{x^n^) simply denotes the sum of expressions like [x^n^ for all the commodities recorded and ^'(a'gWg) denotes the sum of expressions like {x.^n.^ for all the commodities recorded, H, the old English S, being a well-known conventional abbreviation for ' Sum of expressions like.' Thus, with the numbers in Table (10), we should have 2'(a:iWi)=(2-2)(5-9)+(21-3)(0-68)+(0-7)(15-6)+ . . . i:(a:2W2)=(7-07)(2-83)+(33-3)(0-57)+(l-25)(20-0)+ . . . AVERAGES 37 Taking 100 as the index number to represent expenditure at the first date, the index number measuring expenditure at the second date may be formed in any of the following different ways,* which as a rule, of coiu-se, lead to different results : — (1) l002:ix,n,)lUix,n,) ; (2) I00i:{x^n,)/i:{x,7i,) or lOOUix-.n^yZix^n^) ; (3) l00U{Xjn2)l2:{Xini) or 100i;(.r2W2)/Z'(a:2Wi). The first of these expressions compares the actual expenditure at the second date to that at the first date. The next two expressions take into account directly only the change in prices ; they compare, not actual expenditures but, the expenditures at the two dates as they would be if the amounts purchased at the two dates were the same : the first supposing these amounts to equal those actually bought at the first date, and the second supposing them to equal those actually bought at the second date. The last two expressions, on the other hand, take into account directly only the change in amounts pui'chased ; they compare the expenditures at the two dates as they would be if the prices ruling at the two dates were the same : the first supposing these prices to equal those actually charged at the first date, and the second supposing them to equal those actually charged at the second date. The particular method of weighting adopted must naturally de|3end upon the circumstances of the period under discussion and the nature of the inquiry one is making ; it is a nice question to decide how far emphasis should be laid upon the old standard of life (measured by food, lighting, rent, recreation, etc.) with the expense required to maintain it, and upon the new standard of life and the cost necessary to reach it. It may be useful here to summarize a few of the questions of interest which present themselves in connection with the formation of index numbers of prices designed to measure changes in the value of money in general without reference to any particular class of the community : — 1. What years should be selected in fixing our standard prices ? 2. What commodities should be chosen as a basis for our average ? [* See also The Measurement of Changes in the Cost of Living, by A. L. Bowie}', Sc.D., in the Journal of the Royal Statistical Socictij, May 1919, for a more complete dis- cussion of the subject.] 38 STATISTICS 3. What weight should be given to each commodity in relation to the rest ? 4. How should the prices of the several commodities be deter- mined, bearing in mind that ' price ' itself frequently varies from place to place ? 5. Finally, how should these prices be combined to give the average required ? Should we use the simple arithmetic mean, the geometric mean [see Appendix, Note 3], the median, or some other measure ? While we are not prepared to attempt to answer these questions fully, seemg that authorities are not altogether agreed as to what the answers should be, one or two points may be worth noting. Generally speaking we may say that : — 1. The 3'ears selected m fixing our standard prices should be years in which economic conditions were normal rather than abnormal. 2. The commodities chosen should be articles of general con- sumption, and as wide a field as possible should be covered in their choice. 3. Many consider that little is gained by weighting, but, if weights are introduced, the greater the importance of any com- modity in relation to the rest, judged for example by the relative quantity consumed, the greater should be the weight assigned to it. 4. The practical difficulty of assessing retail prices when they are uncontrolled compels us in general to fall back upon whole- sale quotations, on wliich some light may be thro^ra by keeping under observation the important markets for the sale of each commodity. 5. The average commonly used is the simple arithmetic or the weighted mean, though arguments can be adduced in favom- of other averages such as the median. Leaving index numbers now on one side and returnuig to the general subject of averages, we may remark that the question which average is correct in any given case, the mean (weighted or otherwise), the median, or the mode, does not arise : no one average is more correct than another, because they are all entirely con- ventional and represent different ideas ; they correspond in fact to so many different ways of summing up a set of observations or measurements in a single numerical statement, and the real question AVERAGES 39 to determine is which statement, which kind of average, brings the set of observations before us to the best focus. For this purpose one average will clearly be best in one case and another in another, but it may be stated without hesitation that the ai'ithmetic mean is certainly the most useful of the three and it is the most frequently used. Other averages have been sug- gested, such as the geometriG and the harmonic means [see Appendix, Note 3] familiar to students of Algebra, as suitable in special classes of problems.* In a reasoyiably symmetrical distribution of observations, one in which the variables of medium size arc the most frequent and the frequency diminishes about equally on either side towards the largest and the least of the variables, the values of the mean, the median, and the mode will be found to lie all very close together ; and a useful jsractical rule to remember is that the median comes in general between the mean and the mode, the difference between the mean and the mode being about three times the difference between the mean and the median. This rule, for lack of a better, might be used to determine the mode in suitable cases, or it might be used to test the value found in some other way. The general term ' average ' is frequently used when the par- ticular denomination ' arithmetic mean ' is implied, but the context will usually prevent misunderstanding. In order to get a clear impression of the outstanding features presented by the three chief averages discussed, let us go over them once more in the case of marks awarded to a number of students in a class. All three may be regarded as in a sense measures of the standard reached by the class as a whole in the examination, but the measures are made in different waj's : — 1. The Arithmetic Mean is found by merely dividing the aggregate marks of the class by the number of the students, and it gives the marks earned by each student if we conceive them all to be of equal merit. 2. The Median is found by ranging the students in order of merit from top to bottom, and picking out the marks awarded to the one who comes half-way down the list. 3. The Mode is the most fashionable number of marks, i.e. the marks obtained by the greatest number of candidates. The advantages and disadvantages of the three types may be set out broadly .as follows, although the boundary lines must not be too strictly drawn : — * See Note on p. 41. 40 STATISTICS Mean. Median. 1 Mode. j 1 Easy to calculate when Easy to pick out when Not easy to determine the values of the vari- the individuals can with precision, when able can be summed be ranged in order the observations faU and their number is according to the into groups of differ- known. value or degree of ent ranges, without the variable ob- fitting a frequency served. curve to the distribu- tion as a whole. Well designed for alge- Unsuited for algebrai- Unsuited for algebrai- braical manipulation, cal work. cal work. as, for example, when we wish to combine different sets of obser- vations [see Appendix, Note 4, for two illus- trations]. Affected sometimes too Determined merely by Unaffected by abnor- much by abnormal in- its position in the mal indiAaduals, and dividuals among the distribution, and its owes its importance observations. actual value is thus to the fact that it is quite unaffected by located in the region abnormal individuals. where the frequency is most dense. The reader should test his grasp of the principles so far intro- duced by applying them himself to a concrete case. For exami^le, he might use the data in Table (11), with regard to wages earned by certain women, taken from Tawney's Minimum Wages in the Tailoring Trade, and based upon the 1906 Wages Census. Let him begin by roughlj^ estimating the mean, the median, and the mode from an inspection of the distribution. He might then proceed to calculate the mean wage : — (1) taldng the actual frequencies given in the table ; (2) taking simple sub-multiples of these frequencies, roughly one- hundredth part of each : 2, 4, 6, 7, 9, 11, etc. ; (3) assuming unit frequency in place of that given in the table for each wage grouj). Finally, he might determine the median and the mode in the manner explained in the text, deducing the latter from the relation (mean— mode)— 3(mean— median). AVERAGES The results obtained should be (1) 13-08S. ; (2) 13-lOs. ; (3) 15-59s. Median=12o3s. : Mocle=ll-43s. 41 Table (11). Distribution of Wages of certain Women Tailors. (1) (2) (3) (4) No. of "Women j No. of Women Wages between limits earning wages as shown in Wages between limits eaniiiig wages as shown in Column (1). Column (3). 5s. and less than 6s. 180 16s. and less than 17s. 642 6s. , 7s. 384 17s. „ 18s. 453 7s. , 8s. 553 18s. „ 19s. 401 8s. , 9s. 690 19s. „ , 20s. 272 9s. , 10s. 900 20s. „ 21s. 251 10s. , lis. 1145 21s. „ , 22s. 138 lis. , 12s. 1201 22s. „ 23s. 124 12s. , 13s. 1138 23s. „ 24s. 64 13s. , 14s. 930 24s. „ 25s. 54 14s. , 15s. 885 25s. „ , 30s. 122 15s. , ' 16s. 790 .. * [One important example of the use of the geometric mean is in the con- struction of the Board of Trade Index Number of Wholesale Prices— see article by A. W. Flux, C.B., M.A., in the Journal of the Royal Statistical Society, March 1921.] CHAPTER VI DISPERSION OR VARIABILITY Let us suppose that two men set out separately on walking tours and that they walk as follows : — The total distance covered in six days, namely 150 miles, and therefore also the mean rate of walking, 25 miles a day, are thus exactly the same in both cases, but the disjJersio^i of the values of the variable (the variable being in this instance the number of miles walked per day) round about their mean value, the variability, is different in the two cases. The greatest deviation from the average in the fii'st case is five and in the second case it is ten miles. Thus, besides knowing the average of a set of values of a variable it is important to measure the dispersion of the distribution. Are the observations crowded in a dense mass around the average, or do they tail off above and below it, and to what extent ? In other words, what is the variability from the average of the distribution ? Mean Deviation. Now we are not concerned here with the signs of the separate deviations, with the question, that is, whether any particular value of the variable lies above or below the average : DISPERSION OR VARIABILITY 43 it is only of their amount we wish to take cognizance, and perhaps the most obvious Avay to measure the total variability and at the same time to ignore the signs of the separate deviations from the average is to add up these deviations, treating them all as signless, and to divide the result by their total number. This gives what is known as the mean deviation of the system of observations — it is the ordinary arithmetic mean of the separate deviations, treated as if they are all in the same direction, and, in measuring them, we may use either the mean or the median as the average, but it would seem preferable to take the latter because the mean deviation is least when the median is chosen as the origin, or zero point, from which the differences are measured. The proof of this fact will be found in Note 6 in the Appendix, but we may readily test it in a given case. Let us adapt the ' walking ' illustration used above, slightly extending the figures and making them unsymmetrical, i.e. of unequal variability on either side of the average, so as to prevent the median coinciding with the mean. We then have an amended table setting out the number of miles walked b}^ a certain man on successive days during, sa^^, a fortnight's tour, as follows : — Table (12). Number of Miles walked on Successive Days. (1) (2) (3) (4) (5) (6) (7) (8) No. of (lays. Miles walked. X Deviation from 25. ^1 r. Deviation; j)^^-^j^„ |-- from 24. Xi Deviation from 2G. fx [No. in Col. (1 )] X [No. in Col. (3)]. M [No. in Col. (l)]x [No. in Col. (4)]. 1 2 3 3 2 2 1 14 10 15 20 25 30 35 40 15 10 5 5 10 15 14-64 9-64 4-64 0-36 5-36 10-36 15-36 14 9 4 1 6 11 16 16 11 6 1 4 9 14 15 20 15 10 20 15 14-64 19-28 13-92 1-08 10-72 20-72 15-36 •• •• 95 95-72 The first two columns show that 10 miles was the distance walked on the first day, 15 miles on each of the next two days, 20 miles on each of the next three days, and so on until the last day, when 40 miles was the distance walked. 44 STATISTICS The median in this case, being the number of miles walked on the middle day when the daj's are ranged in order of mileage from the least to the greatest, is 25, for this is the distance covered on both the seventh and the eighth days which come half-way along the series. Col. (3) shows the deviations from the median, 25, of the distances covered each day as recorded in col. (2), and col. (7) enables us to sum these deviations when each is multiplied by the number of days to which it corresponds, since these numbers, given in col. (1), show how many times each deviation is repeated. Hence the mean deviation, regardless of sign, measured from the median -[(Ixl5)+(2xl0)+(3x5)+(2x5)+(2xl0)+(lxl5)]/14 = (15+20+15+10+20+15)/14 =95/14 =6-79 miles. We may compare this with the corresponding deviations measured from (1) the arithmetic mean, (2) the number 24, and (3) the number 26 as origin respectively. 1. The arithmetic mean of the distribution is obtained at once by multiplying the corresponding numbers in cols. (1) and (2), adding the results, and dividing the total by 14, thus . .^, . l(10)+2(15)+3(20)+3(25)+2(30)+2(35)+l(40) Arithmetic mean= - — 1+2+3+3+2+2+1 10+30+60+75+60+70+40 14 =345/14 =24-64 miles, and the deviations from 24-64 are shown in col. (4) ; the mean deviation from 24-64, obtained by combining cols. (1) and (4) and adding as shown in col. (8) = [i(14-64)+2(9-64)+ . . . ]/14 =95-72/14 = 6-84 miles. 2. Similarly, the mean deviation from 24, making use of col. (5), = [l(14)+2(9)+ . . . ]/14 = 6-93 miles. DISPERSION OR VARIABILITY 45 3. And the mean deviation from" 26, making use of col. (6), -[1(16)+2(11)+ . . . ]/14 =7-07 miles. The original determination gives a value which is less than any of these three results, as was anticipated. The mean deviation from the median is, however, difficult to calculate with exactness when the observations are recorded in groups between different limits : for this and other reasons we shall not spend much time upon it, and we shall as a rule choose the mean as origin of reference rather than the median. It may be as well to explain the source of the difficulty by a small hypothetical illustration. Let us suppose that in making measurements of some organ or character in 13 individuals we get a result l}dng between 4 and 6 units on six occasions, between 6 and 8 units on four occasions, and between 8 and 10 units on three occasions. Here, assuming that all the individuals in any group have the mid-value measurement for that group, i.e. treating the distribution as one of 6 individuals with a variable measuring 5 units, 4 individuals with a variable measuring 7 units, and 3 individuals with a variable measuring 9 units, we get ^ as the mean deviation with 7 as origin and ^ for the mean deviation with 6-5 as origin, as the following table shows : — Now the result obtained is in agreement with the minimum mean deviation theory, granted that 7 is the median measurement, as it might certainly be. But it is not so of necessity, and in that case the assumption italicized might lead, in the above calculation, to appreciable inaccuracy unless the number of observations is large and the class-interval is small. For example, the actual 46 STATISTICS distribution might, without contradictmg the previous data, con- ceivably run : — Measurement. /' Frequency. Deviation from 7. y' Deviation from 6-5. fx' fy' 5 6-5 7-5 9 6 2 2 3 2 0-5 0-5 2 1-5 1 2-5 12 1 1 6 9 2' 7-5 18-5 13 •• •• 20 But in this case the median, the measurement for the seventh indi- vidual from either end of the series, is 6-5, and according to the first calculation the mean deviation referred to 6-5 as origin appears to be greater than that referred to 7 as origin. If, however, we recalculate, using the more detailed table, we find that the mean deviation referred to 6-5 as origin (^^) is really less than the mean deviation with reference to 7 as origin, as it should be, for the latter now turns out to be %. Standard Deviation. An alternative method of avoiding the signs of the deviations from the average in order to estimate the amount of variability of the distribution is to square each separate deviation, sum the squares, divide by their number, and take the square root of the result. This gives the root-mean-sqimre deviation, and it is least when the arithmetic mean of the variables is chosen as origin from which to measure the deviations, when it is known as the standard deviation. For proof of this minimum principle see Appendix, Note 5, but it is worth while testing it also with the data given in Table (12). The numbers in cols. (3) to (6) in Table (13) are obtained simply by squaring the corresponding numbers in the same cols. (3) to (6) in Table (12). Col. (7) is formed in order to enable us to calculate the mean-square deviation referred to 25 as origin ; the numbers in col. (3) show the squares of the deviations for each individual observation, and the numbers in col. (1), by which they are multi- plied, show how frequently the same values are repeated. Hence we get the mean-square deviation with reference to 25 = [l(225)+2(100)+3(25)+2(25)+2(100)+l(225)]/14 =975/14 = 69-64. DISPERSION OR VARIABILITY 47 Thus the root-mean-square deviation referred to 25 = V(69-64) = 8-345. Similarly, by means of col. (8), formed on exactly the same principle, we find that the root-mean-square deviation referred to 24-64 as origin = V'[(214-33+ 185-86+ . . . )/14] = V(973-22/14) =8-338. But 24-64 is the mean of the distribution, hence 8-338 is the standard deviation. With the help of cols. (5) and (6) the student may himself calcu- late the root-mean-square deviation wdth regard to 24 and 26 respectively as origin ; the results should be 8-36 and 8-45. Of the four values thus obtained for the root-mean-square deviation, the least is that referred to the mean as origin, the standard devia- tion, now proposed as a measure of variability or dispersion suitable for most general purposes. This measure possesses several decided advantages over the mean deviation ; among others it lends itseK more easUy to certain algebraical processes (see, for example, p. 158), a fact of importance when we wish, for instance, to discuss two sets of observations in combination, and it is in general less affected by ' fluctuations of sampling ' — errors which arise owdng to the fact that we cannot as a rule survey the whole field of operations, but have to be content "with a sample. Table (13). Number of IMiles walked on Successive Days. (1) (2) (3) (4) (5) (6) (7) (8) / No. of da3's. Miles walked. Square of Deviation from 25. Square of Deviation from 24 -64 Square of Deviation from 24. Xi Square of Deviation from 20. fx-2 [No. in Col (1)] X [No. in Col. (.3) j [No. in Col. (1)] X [No. in Col. (4)] 1 10 225 214-33 196 256 225 214-33 2 15 100 92-93 81 121 200 185-86 3 20 25 21-53 16 36 75 64-59 3 25 . . 0-13 1 1 0-39 2 30 25 28-73 36 16 50 57-46 2 35 100 107-33 121 81 200 214-66 1 40 225 235-93 256 196 225 235-93 14 •• 975 973-22 48 STATISTICS Quartile Deviation or Semi-interquartile Range. There is a third measure of dispersion, based upon the determination of the quartiles, and to introduce them we may refer again to Table (7) in order to show how the idea of the median may be extended. We define the individual occupying a position one -quarter the way along any series of observations, arranged in ascending order of magnitude of some organ or character common to all the indi- viduals of the series, as the lower quartile ; and we define the indi- vidual occupying a position three-quarters the way along the series as the upper quartile. When the distribution of observations is divided up into groups lying between different limits of the variable under consideration the quartiles may, like the median, be calculated by interpolation. Thus, in the examination example, the total number of candidates is 514 and i(514)=128-5. But the 91st candidate from the bottom gets approximately 20 marks, and the 149th candidate from the bottom gets approxi- mately 25 marks. Hence the imaginary candidate, No. 128-5, should get a number of marks lying somewhere between 20 and 25. But if, in this neighbourhood, a difference of (149-91) candidates corresponds to a difference of 5 marks, 37'5 (128-5-91) ,, should correspond ,, 5x marks. Thus, the marks assigned to the lower quartile candidate are approximately 58 =20+3-23. Hence the lower quartile=2S-2d. Again |(514)= 385-5. But the 318th candidate from the bottom gets approximately 35 marks, and the 397th candidate from the bottom gets approxi- mately 40 marks. Therefore, the imaginary candidate, No. 385-5, should get approximately a number of marks 79 = 39-27. Hence the upper quartile=39-2'7 . DlSl'ERSION OR VAlilABlLITiT 4d It is clear that the quartiles together with the median divide the whole series of observations into approximately four equal groups, so that the quartile marks give a rough idea of the 23'23 31 '52 39'27 distribution on either O j;j^^] — q;: side of the average. For this reason half the difference between the quartiles provides a convenient measure of the dispersion, and it is called the quartile deviation or semi-interquartile range ; thus, if Q be the lower and Q' the upper quartile, we have the quartile deviation=^\{Q' —Q). In the above example, this measure = i(39-27-23-23) = 1(16-04) = 8-02. If a more minute analysis of the distribution of variables is desired, we may range them in order of magnitude as before, and divide up the series into ten equal parts, recording every tenth along the line ; these tenths are called deciles. Thus, the deciles in the examination example correspond to the marks assigned to imaginary candidates numbered as follows : — 51-4, 102-8, 154-2, 205-6, 2570, 308-4, 359-8, 411-2, 462-6, and they can be calculated by the interpolation method used in fuiding the median and quartiles. This way of representing the chief features of a distribution, by quartiles, etc., was much used by Galton in his researches and writings. The student mo^y be perplexed as to which should be used of so many different measures of dispersion or variability, but there need be no real confusion. If a rough estimate only is wanted the quartile deviation is a convenient measure, assuming that the variables observed or measured can be ranged in order of magnitude so as to admit of the quartiles being readily picked out. Also the measure thus obtained is not unsatisfactory when the distribution of values of the variable is fairly symmetrical and uniform in its gradation from greatest frequency to least. If, however, it is conspicuously skew (unsymmetrical) and there are erratic differ- ences in frequency between successive values of the variable, it is better to choose a measure which gives the magnitude and the position of each recorded observation its due weight in the deviation sum. D 50 STATISTICS Then again the choice as between the standard deviation and the mean deviation may be sometimes determined by the particular kind of average which suits the problem best. But as the arith- metic mean is the most important and the most commonly used average, so the standard deviation is certainly the most important measure of dispersion. It will be shown later that the following relations are approxi- mately true when the distribution of variables is not very far from being symmetrical : — (1) Quartile deviation=^ ^{Standard deviation). (2) Mean deviation =i{Standard deviation). In (2) the mean deviation should be measured from the mean. Also (3) a range of two or three times the standard deviation on both sides of the mean Avill be found to include the majority of the observations in the distribution. Coefficient of Variation. Before \Ye pass on to illustrate the subject of averages and variability by means of a few examples it is necessary to introduce one more constant known as the co- efficient of variation. It is a measure of variability but it differs from the chief measures alread}^ discussed in that they are absolute measures, whereas the coefficient of variation, wTitten C. of V. for short, is a ratio or relative measure. The need for it arises when we reflect that in order to gauge fairly the amount of variability we ought to have in mind also the size of the mean from Avhich the variation is measured ; just as a difference of 1 foot between the heights of two men is a conspicuous difference when the normal height is between 5 and 6 feet, whereas the same difference of 1 foot between two measured miles would be trifling because the standard mile contains over 5000 feet. The coefficient of variation has been defined by Karl Pearson {Phil. Trans., vol. 187a p. 277), who first suggested its use, as ' the percentage variation in the mean, the standard deviation (S.D.) being treated as the total variation in the mean,' so that C. of V. = 100 S.D./Mean. He pointed out that it would be idle, in dealing with the variation of men and women (or indeed very often of the two sexes of any animal), to compare the absolute variation of the larger male organ directly with that of the smaller female organ, because several of these organs, as well as the height, the weight, brain capacity, etc., DISPERSION OR VARIABILITY 51 are greater in man than in woman in the approximate proportion of 13 : 12. As an example of the use of the C. of V., figures may be quoted from a paper by R. Pearl and F. J. Dunbar {Biomelrika, vol. ii. pp. 321 et seq.), On Variation and Correlation in Arcella. Measure- ments in mikrons were made of the outer and inner diameters of 504 sjDecimens of a shelled rhizopod belonging to the group Impcr- forata, family ArcelUna, with the following results, to two decimal places : — Mean. S.D. C. of V. Outer diameter . Inner „ 55-79 15-91 5-73 217 10-27 per cent. 13-66 „ Thus, judging by the S.D. column, giving the absolute size of deviation, the outer diameter would appear to be more variable than the inner, but the C. of V. column shows that, if we take the sizes of the two diameters into account, the inner is reaUy the more variable of the t^^■o. To turn aside the edge of possible criti- cism it should be added that the authors also give the errors to which the above measures are subject, as unless these are known we cannot tell whether the differences observed in variation are significant or not of a real difference in fact, but that question must be left until the theory of errors due to sampling has been developed in a later chapter. The C. of V. varies considerably for different characters. W. R. Macdonell states that ' 3 to 5-5 are rei^resentative values for varia- biUty in man, Avhile in plants it may run to 40,' and Pearson and others have shown that for stature in man it varies from about 3 to 4 and for the length of long bones from 4 to 6. CHAPTER VII FREQUENCY DISTRIBUTION : EXAMPLES TO ILLUSTRATE CALCULATING AND PLOTTING : SKEWNESS Calculation of Mean and Standard Deviation. Example (1). — We return now to the examination example in order to show how the labour of calculation in finding the arithmetic mean and standard deviation of a frequency distribution may be somewhat lessened. The various steps in the process appear in Table (14). In the first column the marks at the middle of each class-interval have been written down, and we make the assumption that all the candi- dates in any one class have the same number of marks, namely, the marks at the middle of the class-interval. In any case where the number of observations is large, and where the class-intervals are reasonably small, the errors resulting from such an assumption will be insignificant, because the individuals in each class are Just as likely to have values above as below the value at the middle of the class-interval, and they will therefore compensate for one another. We now seek to alter the scale of marking so as to produce a simpler set of marks than the original, which will make the work of finding the mean also simpler, but we must not forget at the end to change back again to the original scale. We choose a number from col. (1), somewhere near the required mean, to act as a land of origin from which to measure the other numbers in the column. This choice is only a rough guess, and it is really immaterial which number is selected as origin, except that the nearer it is to the mean the lighter will be the calculation to follow ; the number 33 has been selected in this instance. In col. (2) are written down the deviations of the marks in each class from 33, so that now some candidates appear as if they were 6, 10, 15 . . . marks to the bad, and others as if they were 5, 10, 15 ... to the good. So long as we remember to add 33 at the end we can content ourselves therefore by finding the mean of the marks as given in col. (2). But these again can be further simphfied by dividing each candidate's marks by 5, and we then only need 52 FREQUENCY DISTRIBUTION 53 to find the mean of tlie marks as shown in col. (3), so long as we remember to multijDly by 5 at the first step back to the old scale of marking. The addition of col. (5) makes it easy to calculate this mean, for it gives the result of multiplying each value of the variable (the number of marks in each class) by its appropriate weight (the number of candidates who obtained that number of marks). Table (14). Marks obtained by 514 Candidates in a certain Examination — (Analysis of Method for Calculating Mean and Standard Deviation). Thus, on this new scale, the mean marks obtained are 5(_G) + 9(-5)+2S(-4)+ . . . +87(0)+ • • • +6(+5) + 3(+6) 514 -532+422 5l4 -110 514 :-0-214. 54 STATISTICS This, then, is the mean of the marks obtained by the candidates on the scale indicated in col. (3). If the marks are on the scale given ,in col. (2), the mean is 5(— 0-214), i.e. —1-070. To bring them back to the original scale as in col. (1) we must add 33 to this result, so that the required arithmetic mean =-33+5(-0-214) = 33-1-070 = 31-93. To find the Standard Deviation, or the root-mean-square deviation from the arithmetic mean, it is convenient as before to work with the simplified scale, to measure the deviations from the arbitrary origin (33) associated with that scale, and to make the necessary corrections at the end of the work. Col. (5) in Table (14) gives the deviation multiplied by tlie frequency in each class, the frequency denoting the number of times the particular deviation occurs. Hence, if these numbers be multiplied again by the numbers in col. (3), we shall have each separate deviation squared and multiplied by its frequency. The results are shown in col. (6), and they must be added, and their sum divided by the sum of the frequencies (514), to give the mean- square deviation, which we may represent by s^. Thus 52^2814/514 = 5-475, and this is the mean-square deviation referred to 33 as origin. We require the corresponding ex23rcssion referred to the mean, 31-93, as origin. If we denote this by s,,^^ there is a simple relation connecting the two, namely, where x is the deviation of the mean itself from 33 [see Appendix, Note 5] ; of course 5„j, s, and x are all to be measured on the same scale, the simplified scale adopted with 5 marks as unit. Now we have already shown that the deviation of the mean from 33 — —0-214, and this is therefore the value of x. Hence ^^) s,„2_5-475- (-0-214)2 ^('Y' .r =5-475-0-046 ,,S' y =5-429 FREQUENCY DISTRIBUTION 55 And, returning to the old scale, the standard deviation, usually denoted by a =5(2-33) = 11-65. We notice that 3ct= 34-95, and this ranj^e on either side of the mean amply takes in all the observations. The mean deviation is readily found from Table (14) by adding up the numbers in col. (5) regardless of sign and dividing by the sum of frequencies, 514. Thus, on the new scale, the mean deviation 0 54 5 14 = 1-856, which, on the old scale, becomes 5(1-856) or 9-28. This, however, is the mean deviation measured from 33 as origin, and a correction has to be applied to get the mean deviation measured from the median or from the mean. To get the mean deviation from the mean we note that the difference between the mean, 31-93, and 33 is 1-07. Hence it should be clear from Table (14) that, by measuring from 33 instead of from 31-93, Ave have made the deviations of all the marks from 33 upwards too little by 1-07, and we have made the deviations of all the marks frojn 28 downwards too much by 1-07. Hence, to get the deviation required we must add to 9-28 an amount = 5T4[l-07(87+79+ . . . +3)- 1-07(82+58+ . . . +5)] 1-07 =iJi: (283-231) 514 • = x52 514 =0-108. Therefore, the mean deviation measured from the mean=9-39. This may be compared with ;, (standard deviation) =9-32. Also the quartile deviation _f or this distribution has been shown to be=8-02, and it may be compared with |(standard deviation) =7-77. Plotting of a Frequency Distribution. The data for the two examples which follow are taken from the Quarterly Return of Marriages, Births, and Deaths, No. 261, issued by the Registrar- General. 56 STATISTICS The first shows the proportion to population of cases of infectious disease notified in 241 large towns of England and Wales for the thirteen weeks ended 4th April 1914. This proportion was given for each town sei^arately in the Return, but, in order to bring out the distinctive features of the distribution, the several towns have Table (15). Proportion to Population of Cases of Infectious Disease notified in 241 Large Towns of England and Wales during the Thirteen Weeks ended 4th April 1914. Case Rate per 1000 persons living. Each dot below represents One Town with Notified Rate of Infectious Disease between limits as given in previous coUinin. Total No. of Towns with given Rate. 0— 2— 4— 6— 8— 10— 12— 14— 16— 18— 20— 22— 24— 26— 5 39 69 41 29 22 16 7 5 3 4 0 0 1 241 been, in Table (15), represented by dots and put into different classes according to the proportion of infectious cases notified in each, with a separate line for each class : e.g. if the proportion for any town was 5-37 a dot was placed in the line corresjDonding to the class of to^\'ns for which the rate was ' 4 and less than G.' Every FREQUENCY DISTRIBUTION 67 fifth dot in each line was ticked off, so as to make them easy to count up and also to keep the lines, down the paper as well as across, straight. The frequency, i.e. the number of dots in each class, Avas then recorded in a column at the extreme right-hand side of the paper. 70 fc,65 60 to 55 50 §45 'tr.. ?40 ;35 s:30 W25 => 20 a^lO 03 5 o : :; ± -- i 1 1 ' j 1 1 , — -j |-i — ) — ] ^__L , — ( ' — , , — U-H^-^ A - "\ , • j ' , 4 1 1 ' 1 • f~ -• 4^ _^_ _i 1 ^ _^ ^ L-i — o--* 1 [— p-l 1 1 j j It ::: :: ::: :: : ::::::: -f- 4 -I -•- j-j-«--2----j-n |--i 1 «- -»_ -4-L4 »_-, ^p4 — 4--« n- -1 - -4- -4- -4 -p-4--4--i — 4--i 1 4J^4n-» — *--• l--r-T h- 0- -o-p4- -4- -4 — 1 1 1 4_-4_L4__4_^4-L, »-'-*- » i « -»--« 1 — -^' -?-»"-• T 4- -if- -»- _»- -»_ -« ] — L4_ _«_. _4_ _4- _4_ _4 1 -__4__n--4--«--r-4--XX -^-f-p-j--4— 4--4-p-4 1 -•--•--I i---»--«--»--»--i -1-9- -•- -• 0 5 10 15 20 25 30 ,x Rate of Disease per 1000 persons living Fio. (1). It will be at once seen that this procedure, without calculating any averages, etc., ultimate!}' gives to the eye a very good picture of the distribution, and indeed it is the basis of the graphical method of studying statistics. In drawing a proper graph we use a specially ruled sheet of paper which is divided up into a large number of equal small squares by ' horizontal ' (cross) and ' vertical ' (up-and- 58 STATISTICS down) lines. This merely enables us to j)lace our dots accurately in position, as shown in fig. (1), where the numbers 0, 5, 10 . . . have been marked off along the Hne Ox to correspond to * case y 70 CD vC,65 S 60 §55 CD .03 .o Ho O ^45 40 C5. S35 '5>30 25 2 20 o §"10 0: , L I ^ Modal Line ^/ t ^^ t t^^ Y ^^ 1 ..^ ^^ ir X^ it . . A t /'^ _ . . ^S" ^^1 I \ ± n jr X 1 "t I t A ^ 1. v.^ \I ± -^ / X ^^t " " T 3 IL "it 'T~ % X 3-.-^! +__ 20 25 Rate of Disease per 1000 persons living Fio. (2). rates ' of these magnitudes : thus rates of ' 4 and less than 6 ' were recorded by 69 successive dots along a vertical line at a dis- tance 5 (the centre of the class-interval 4-6) from the axis Oy. FREQUENCY DISTRIBUTION 59 The final configuration in fig. (1), when turned half round, is exactly the same as that of Table (15). K desired the frequency y X ip T ^ "S •♦r^ S nn % ^° ,"2 4- O "5 KC g ^^ .O ■«o ^AK C^45 ^*" '^ Modar tme = An .,, ^^ o 40 "■? "* ^ -j- M ' ; dian- Ijrine Q. ■ ' 7 ^^ P -3 = ,f ■ ^ y '' Vif T • ; — ^/ M ;an i±ine S '1 J^ T / •2 __ ; ■ ' 1 -T-- , 1 _r _j_ 05 i 1 i" -c ■"^ <: ~J" i2 *K 2r> '5 — 1— o s a ^ ^n QJ 10 u; 5 ■*- -•- 1 ■*" 1 -^ ^ ,1 n - t_:i,:._: , .TLTti-i::!:-- O 5 10 !5 20 25 30 Rate of Disease per 1000 persons living Fio. (3). may be recorded, dot by dot, on a side piece of paper and then only the topmost dot in each class need be marked on the graph sheet. In order, however, to enable the eye to measure the height 60 STATISTICS of each frequency in relation to the rest, it is advisable in that case to connect up adjacent dots as in fig. (2) or as in fig. (3). The last method of representation (fig. (3)), to which the name histogram has been given by Professor Karl Pearson, is particularly useful and should be carefully studied. It is formed in this case by erecting a succession of rectangles with the lines 02, 24, 46 . . . along Ox as their bases, corresponding to the successive classes of the given distribution, and with heights proportional to the fre- quencies proper to those classes. It is not necessary to complete the sides of the rectangles, but, if they were completed, each would enclose a number of squares proportional to the frequency of towns ■with the rate of disease defined by its base : e.g. the first rectangle would enclose 10 squares, the second 78, the third 138, and so on, numbers respectively proportional to 5, 39, 69, and so on. It follows that the total area enclosed between the histogram and the axis Ox is proportional to the aggregate frequency of towns observed. Now we might conceive a step further taken and a smoothed curve drawn freehand so as to agree as closely as possible with fig. (2) or fig. (3), but with all the sharp corners smoothed out, and so nicely adjusted as to make the area enclosed between the curve, the axis Ox, and lines parallel to Oy defining the limits of any class, proportional to the frequency of towns in that class. To this fig. (2) and fig. (3) might be regarded as aj)proximating if only a sufficient number of observations were recorded, and only in that case would it be possible to draw it with any accuracy. Such a curve is called a frequency curve, measuring as it does the frequency of the observations in diiferent classes. [Assuming that corresponding to a given frequency distribution a curve of this kind does really exist — and the assumption turns upon the frequency being continuous — the reader who is acquainted with the notation of the Calculus wdll recognise that, if {x, y) represents any point on the curve, ybx measures the frequency of observations or measurements of an organ or character lying between the values x and {x-\-bx), when the total frequency comprises a large number of observations, say 500 to 1000. Further, it will appear later that the mean, the median, and the mode have a geometrical interpretation of no small importance associated with the curve. The mean x corresponds to the particular ordinate y which passes through the centroid or centre of gravity of the area between the frequency curve and axis Ox, because the mean= J ^ 'S,{x . ybx) j J^ 2(7/5a:), 5.1-^0 S,r->-0 where the summation extends throughout the distribution, = jxydx/jydx where the integral extends throughout the curve. FREQUENCY DISTRIBUTION 61 The median x corresponds to the ordinate y which bisects this same area ; e.g. in fig. (3), the number of small squares on either side of the median in the space bounded by the histogram and the axis represents half the total number of observations, two small squares corresponding to each observation. The mode x corresponds to the maximum ordinate of the curve, measuring the greatest frequency in the whole distribution.] Skewness. There is one feature of a frequency distribution which catches the eye sooner ahnost than any other, and that is its sym- metry or lack of symmetry. It is important therefore that we should have some means of measuring it. In a sjanmetrical distribution the mean, mode, and median coincide, and we have, as it were, a perfect balance between the frequency of observations on either side of the mode or ordinate of maximum frequency. In a skew distribution the centre of gravity is displaced and the balance thrown to one side : the amount of this displacement measures the ske\^^less. But there is another factor to be taken into account, for when the variability of the distribu- tion is great the balance is more sensitive than when it is small, and the difference between mean and mode is consequently more pronounced though it may not be significant of any greater skew- ness. This will be clear in the light of the analogy of the swing of a pendulum. If OPP' denote the pendulum in the accompanying figure, OAA' its mean position, and OBB' an extreme position, the displacement in the position OPP' from the mean, if measured along the scale AB, is AP, and, if measured along the scale A'B', is AT'. But, since the amount of swing in either case is the same, it would be more appropri- ate to write the linear dis- placement as a fraction of the full swing so as to make these two measures also the same, thus AP/AB=A'P'/A'B'. So, in the case of a fre- quency distribution, Profes- sor Karl Pearson has suggested as a suitable measure for skewness, not the difference between mean and mode, but the ratio of this difference to the variability. Thus sJcewness= {mean— mode) [ S .D. 62 STATISTICS or, approximately, =3(mean— median)/S.D. (see p. 39), a form which is sometimes useful. According to this convention the skewness is regarded as positive Skewness Skewness Mode Mean Mean Mode -^ X increasing X increasing when the mean is greater than the mode, and as negative when the mode is greater than the mean. Illustrations of frequency curves, Avith the position of mode and mean marked, will be found in Chapter xvii. We proceed to the detailed calculations necessary in the infectious diseases example. Table (16). Proportion to Population of Cases of Infectious Disease notified in 241 Large Towns of England and Wales during the Thirteen Weeks ended 4th April 1914. (1) (2) (3) (4) (5) Case Rate per 1000 persons living. Deviation from 7. Frequency of Towns with given Rate. Product of Nos. in Cols. (2) & (3). Product of Nos. in Cols. (2) & (4). 0 and less than 2 {X) - 3 (/) 5 (A-) -15 (A-'-) 45 2 „ „ 4 _ 2 39 -78 156 6^ » » 6 - 1 69 -69 69 6 „ „ 8 . . 41 8 „ „ 10 + 1 29 + 29 29 10 „ „ 12 + 2 22 + 44 88 12 „ „ 14 + 3 16 + 48 144 14 „ „ 16 + 4 7 + 28 112 16 „ „ 18 + 5 5 + 25 125 18 „ „ 20 + 6 3 + 18 108 20 „ „ 22 + 7 4 + 28 196 26 „ „ 28 + 10 1 + 10 100 •• 241 + 68 1172 FREQUENCY DISTRIBUTION 63 Example (2). — The various averages and measures of variability of the distribution can be calculated just as in the case of the last example, and the data required to determine the mean and the standard deviation are set out in Table (16). We can afford now to miss out some of the more obvious steps in exjDlanation. On the scale of col. (2), where a difference of 2 in the case rate, per 1000 persons living, is the unit and where a case rate of 7 is taken as origin, the mean, by the result of col. (4) n s_ — 2 4 1 =0-282. Hence, on the original scale, the mean = 7+2(0-282) =7-564. Again, the mean-square deviation, on the scale of col. (2), measured from 7 as origin is 6 241 = 4-863 ; and X, the deviation of the mean from 7 as origin, on the scale of col. (2)=0-282. Thus the mean-square deviation measured from the mean, =4-863- (0-282)2 =4-783. Therefore, the standard deviation a, on the original scale = 2\/'4^783 =4-374. Since 3cr= 13-122, the range ' (mean— 3a) to (mean+3CT) ' includes all but one or two observations. To determine the median, Ave conceive the towns ranged in order according to the proportion of infectious cases notified in each, from the least to the greatest, and the town -^dth the median rate is the 121st from either end. But the 113th towTi has a notified case rate of approximately 6 per 1000, and the. 154th town has a notified case rate of ai^proxi- mately 8 per 1000. Thus a difference of 41 towns corresponds to a difference of 2 in the rate, hence a difference of 8 towns corresponds to a difference of 0-39 in the rate ; therefore the median /•aie=6-39 approximately. By referring to the original records and writing down, the rate 64 STATISTICS for each town in the group ' rate 6 and less than 8 ' in which the median lay, the accurate value of the median turned out to be 6-30. Theloiver quartile or case rate of the imaginary town, No. J(241), or 00-25, one-quarter way along the ordered sequence of towns, is readily shown to be 4-47, and the upper quartile or case rate of to^v^l No. f (241), or 180-75, is 9-84. Hence the quartile deviation = i(9-84-4-47) =^2-69. With this may be compared |(S.D.) = |(4-37)=2-92. Again, the mean deviation measured from 7 =2(^) = 3-253. Measured from the mean, it becomes =:3-253+^'^^^[(41 + 69+39+5)-(29+22+16+7+5+3+4+l)] 241 = 3-253+ (0-564)(67)/241 =3-41 and this may be compared with |(S.D.)=|(4-374)=3-50. If we estimate the mode by inspection of the frequency graphs in figs. (2) and (3), we should say it comes between 5 and 6 ; supposing we call it 5-5, very roughly. In this case, taking the values actually calculated for mean and median, (mean— mode)= 7-56— 5-50 =2-06, and 3(mean— median)=3(7-56— 6-39) = 3(1-17) = 3-51 ; so that the rule (mean— mode) = 3(mean— median) is far from being true according to these results ; this is partly due, of course, to the very unsymmetrical character of the distribution. The relative positions of the mean, median, and modal, points as calculated are indicated in figs. (2) and (3) by three lines drawn parallel to Oy through these points to meet the graph. Finally, skewness:^ (mean— mode)/S.D.=2-06/4-37=0-47. Example 3. — The next example deals ^\'ith the deaths of infants mider one year, out of every thousand born, in 100 great towns in the United Kingdom during the thirteen weeks ended 4th Aj)ril 1914. FREQUENCY DISTRIBUTION 65 The details of the calculation may be left in this case to the reader, who is recommended to follow the method shown in the last example so far as possible throughout, including the plotting of the distribu- tion in different ways. The statistics are as follows : — Table (17). Death Rate of Infants under 1 Year PER 1000 Births. (1) (2) (3) (4) No. of Towns No. of Towns Death Rate. with Death Rate Death Rate. with Death Rate as in Col. (1). as in Col. (3). 30 and under 40 1 120 and under 130 16 50 „ 60 3 130 .. 140 11 60 „ 70 2 140 „ 150 10 70 „ 80 6 150 „ 160 8 80 „ 90 7 160 „ 170 3 90 „ 100 6 170 „ 180 1 100 „ 110 11 200 ,., 210 1 110 „ 120 13 240 „ 250 1 1 The more important results are : — Arithmetic mean=118-9 ; S.D. = 32-2 ; median^ 120-9 ; quartile deviation=19-5. Example (4). — As another examjole corresponding details may be worked out for the following temperature records taken at noon at a certain spot in Chester week by week during a period of time covering five years, the results in this case being : — mean=55-10; S.D.=10-33 ; median=54-88 ; quartile deviation=7-94 Table (18). 257 Weekly Records of Temperature (Fahrenheit). (1) (21 (3) (4) Temperature No. of Records Temperature No. of Records Limits in between Limits Limits in between Limits Degrees. shown in Col. (1) Degrees. shown in Col. (3) 25-5-29-5 1 53-5-57-5 30-5 29o-33o 1 57-5-61-5 31-5 33-5-37-5 9 61-5-65-5 30 37-5-tlo 11-5 65-5-69-5 26 41 •.5-45-5 28 69-5-73-5 13-5 450-49-5 31-5 73-5-77-5 4 49-5-53-5 36-5 77-5-81-5 3 66 STATISTICS Before closing the chapter a sHghtly different manner of graphing the statistics is worth noticing, as it provides us mth a fairly quick though rough alternative method of determining the mode and median. Take, for example, the examination marks data wliich for this purpose must first be thrown into the second form shown below Table (7). We mark off on some convenient scale along OX dis- tances 5, 10, 15, 20 ... 65 from 0 to represent these numbers of marks respectively, and at the points obtained we erect lines parallel to OY of lengths 5, 14, 42, 91 . . . 514 to represent the numbers of candidates who obtained not more than 5, 10, 15, 20 ... 65 marks respectively. A freehand curve is then drawn through the summits of these lines in the manner indicated in fig. (4), starting from a height 5 and rising to a height 514 above the axis OX. It is called an ogive curve. By means of this curve we can approximately state at once how many candidates obtained any given number of marks or less. Suppose, for example, we wish to know how many candidates obtained 22 marks or less, we have only to measure off a distance 22 from 0, represented by ON, and erect a perpendicular NP to meet the curve at P. Since NP=110 we infer from the manner in which the curve has been formed that 110 candidates obtained 22 marks or less, so that, incidentally, the 110th candidate from the bottom must have obtained approximately 22 marks. This suggests that by worldng backwards we can also read off roughly the number of marks gained by any particular candidate when his order in the list is known. Thus, to find the median, i.e. the marks due to candidate No. 257-5, we merely draw a line parallel to OX at a height 257-5 above it and the portion of this fine cut off between the curve and OY measures the median. The value given by this method is approximately 31-5. Similarly the quartiles are found by drawing lines parallel to OX at heights 128-5 and 385-5 above it with results about 23-3 and 39-2 respectively. Again, as we gradually increase the number of marks, the number of candidates getting that number of marks or less must increase also, but the rate of tliis second increase is variable. Th« reader will perceive that where the height above OX changes slowly the gradient of the curve is small, but where it changes by big steps the gradient is steep, and it is at its steepest just in the neighbour- hood where the greatest addition is being made to the height as the marks increase, i.e. where the frequency of additional candi- dates is at its greatest, so determining the mode : this should be FREQUENCY DISTRIBUTION 67 clear on a comparison of the two arrangements of the data in and below Table (7). By sliding a straight-edge along the contour of the curve we can estimate approximately where the curve is steepest, for at this point the direction of turning of the ruler or Y m 1, ..^ m^ J^ / ---r- - / ^ ^ "^ ' 4.50 r X \ 1 7 Jl / unn ZRC it" ^^g . ._^ Upper -6u< It tiIe-iL!n& — P ,,^,_.._,^ ^ -1- ■S -i 4'^ ' 1 . :: : -!3 "^^U r ^ F- -f- .. _ s: i - ^ ' 300 ? 7 t : Q> lA ' . Vledian ..Li^e "S oi^in " . _ T t 1 ^ ,D ^ ■ _, 7 4: t . 7 f ■' "r 5( 1- Lo\ rer yuartile Line ^,^__ ^ t: A- - jt i \ j± I t 1 J ^ 4 J^ ■- n J r\ V ^ ~ir 10 70 20N 30 40 50 60 Number nf Marks Fig. (4). Graph showing the Number of Candidates who obtained not more than any given Number of IMarks. straight-edge must change. This gives for the mode a value in the neighbourhood of 32. It might be advisable to treat the other examples by this method also, so as to compare results. aiAPTER VIII GRAPHS From the mathematical point of view graphs may be regarded as the alphabet of Algebraical Geometry. We can locate a point in a plane, relative to two perpendicular lines or axes as they are called, OX, OY, which serve as boundaries of measurement, when we know y and x, its shortest distances from these boun- daries. This fact serves to connect up Geometry, in which points are elements, with Algebra, in which a:'s and ?/'s, X standing always for numbers, are ele- ments. The names abscissa {ah — from, and scindo — I cut) and ordinate are given to x and y, or, when we refer to them together, they may be spoken of as the co-ordinates of P. The celebrated French pliilosopher, Descartes (1596-1650), was the founder of Cartesian Geometry, and if we may venture to com- press the essence of his system into a single statement, it is this — When a point P is free to take up any position in a given plane, its X and y are quite independent : they may be allotted any values irrespective of one another. Suppose, however, that P is constrained to lie somewhere on an assigned curve, such as APB in the figure, then X and y are no longer inde- pendent, for, so soon as x is fixed, y is fixed also ; it follows that in this case some relation, algebraical or otherwise, such as y=x^ — 2x-\-l, must exist between x and y, and the relation may be called the equation of the curve which gives rise to it. Now, if to every curve there corresponds in this way some equation and to every equation some curve, it seems likely that the simpler the curve the simpler will be the corresponding equation, and vice versa. In fact, the student who does not know it already GRAPHS 69 need only refer to the most elementary treatise on graphs to find that every equation of the first degree in x and y, i.e. one which does not involve any x^, y^, xy, or higher powers, represents some straight line. Any such equation, e.g. x'-3y+12-0, can be at once thrown into either the form (1) y =1, -12 where — 12 and 4 are intercepts made by the line on the axes OX and OY ; or (2) y=Jx-+4, where ^, i.e. 1 in 3, is the measure of its gradient and 4 the height above the origin at which it cuts the axis OY. Further, every equation of the second degree in x and y, which may involve x^, y^, and xy, but no higher powers, represents geo- metrically some conic, a familj^ of curves comprising the parabola, the ellipse, and the hyperbola, with the circle and two straight Lines as particular cases. The earth and other planets, likewise comets, in their journeys through space travel along curves belonging to the same family, one of ancient and historical connections. These conies need not, however, detain us, and we pass on at once to an example of a cubic graph to show how a very little X knowledge of the theory may be put to some practical use. Sujipose a box manufacturer has a large number of rectangular sheets of cardboard, 3 ft. long by 2 ft. broad, and he wishes to make open boxes with them by cutting a square piece of the same size out of each corner and turning up the flaps that are left. How big should the squares be if this is to be done with as little ysaste as possible ? Clearly this is commercially an important type of jsroblem to solve. Let us denote a side of the square to be cut out of each corner by x feet. Then the bottom of the required box will have dimensions (3-2.r) ft. by (2-2x) ft. and its depth will be x ft. 3-2. V [The 3ft. >- ihaded flaps are bent upwards along the dotted lines.] 70 STATISTICS Hence the capacity of the box when completed will be x{3—2x){2-2x) cu. ft., and he makes best use of the material who jsroduces the most capacious box. Call this expression y and let us find the values of y corresponding to different values of x so as to be able to draw roughly the curve of which the equation is y^x{3-2x){2-2x) (1) Table (19). Table of Corresponding Values of x and y IN THE CcjRVE y=x{3—2x){2—2x). X 2x (3-2x) (2-2.r) a;{3-2.c)(2-2a;) y -1 -2 5 4 -20 -20 -h -1 4 3 - 6 - 6 -I -* h IF - 219 0 0 3 2 0 0 + i + i i ■^ +\% + 0-94 + i + 1 2 i + 1 + 1 H + ij 3 2 * + A + 0-56 + 1 + 2 1 6 0 0 + U + # h -A -h - 0-31 + ii + 3 0 -i 0 0 + 2 + 4 -1 -2 + 4 + 4 + 2^ 0-2 + 5 -2 -3 + 15 + 15 0-4 2-6 1-6 (0.2)(2-6)(1.6) 0-83 0-4 0-8 2-2 1-2 (0.4)(2-2)(1.2) 106 0-6 1-2 1-8 0-8 (0-6)(l-8)(0-8) 0-86 0-8 0-38 1-6 1-4 0-4 (0.8)(1.4)(0.4) 0-45 0-76 2-24 1-24 (0-38)(2-24)(l-24) 1055 0-39 0-78 2-22 1-22 (0-39)(2-22)(l-22) 1-056 0-40 0-80 2.20 1-20 (0-40)(2-20)(l-20) 1-056 0-41 0-82 2-18 1-18 (0-41)(2-18)(M8) 1-055 We get a tolerably good idea of the shape of the curve by plotting the points {x, y) shown in Table (19) from a:= — f to a:=+2 as in fig. (5). It is simply a matter of practice to be able to determine the whole curve from a few points in this way, and the greater the number of points plotted the more accurately will it be possible to draw the curve. It should be noticed that the points for which ?/=0 are in a sense key- points to the curve : they are readily GRAPHS 71 -6 0 0-25 0-50 ■ 0-75 Length of Side of Square cut out Fig. (5). 100 X 72 STATISTICS found by maldng the factors separately zero in the right-hand side of equation (1), namely x=0, 3— 2a:=0, and 2— 2a;=0, and by jjlottuig them first they serve as a guide to the jjosition of points subsequently plotted. We want to Ivnow for what value of x the capacity of the box, y, is greatest and the preliminary plotting is enough to indicate a maximum value for y between x=0 and x=l, for the curve first rises and then falls between these two limits. In order to discover more exactly where the maximum is located we therefore plot in addition the points corresponding to a;=0-2, 0-4, 0-6, 0-8 respec- tively, and this is done on a larger scale than that used in the first diagram because the accuracy is thereby increased (see fig. (5) inset). The calculations and figure suggest that the maximum required is very near the point for which a;=0-4, so we next work out values of y in this neighbourhood, corresponding, say, to a:=0-38, 0-39, 0-40, 0-41, with the results shown at the foot of Table (19). From these we conclude that to a fair degree of accuracy the maximum value of y is given by taking a:=0-395. It would be possible in the same way to calculate more decimal places, but we have gone far enough to make the method clear. Hence the side of each square cut out should be of length 0-395 ft., or 4| in. Whenever the value of one variable, y, dej)ends ujDon that of another variable, x, in such a way that when x is given y is kno'WTi, so that y may be termed a function of x, corresponding values of x and y can be plotted — as was done in the example just discussed — and a curve drawn by joining up the points obtained, the relation which connects x and y being the equation of this curve. More- over, it is possible, by calculating enough points from the equation and plotting them, to get the curve as accurately as we please. In Statistics, however, we usually have to start the other way round and reach the equation, if at all, last. We make observations of two sets of variables, a set of cc's, and a set of y's, one of which is dependent in some way upon the other — e.g. y, the dependent variable, might denote the number of individuals observed to have a certain organ of length x, the independent variable — and thus we get pairs of corresponding values like {x^, y^), {x^, y^), {^z, Vz) ■ • ■ We met with examples of this method of recording results in the last chapter, and we need only rej^eat here that its chief virtue is suggested in the root of the word itself — it is more graphic than a GRAPHS 73 long table of figures and, by means of it, many of the essential features of a problem are immediately seized upon. Now for some j^urposcs it may be necessary to go further and to find what curve would best fit the points plotted, assuming they were numerous enough, and what equation between x and ?/ would best describe the curve. But the graj)hs we meet in Statistics, bearing, for instance, uj)on sociological or biological problems, are in general much more wayward than the mathematical kind we have referred to in the present chapter : it is impossible to set down simple equations to which they can be rigidly confined, and when we are unable to find any relation which accurately and uniquely defines ?/ as a function of x we must rest satisfied with the most manageable equation and the best fit we can get. In sciences such as Engineering and Physics it is often possible to fix upon two mutually dependent variables, x and y, and to observe enough corresponding values of each to enable us to draw a graph which answers very closely to the true relationship between them, so that a connecting equation can be determined ; e.g. we may plot the amount of elastic stretch, y, in a wire Avhen diiierent weights, a-, are hung from the end of it, and it is found that y is directly proportional to x. If we deal in this way with some simple figures which are amenable to our purj)ose it may help to make clear the nature of the same problem in Statistics. The following corresponding values of x and y were given in a Board of Education Examination (1911) :— a;=100, 1-50, 2-00, 2-30, 2-50, 2-70, 2-80 ; y=0-n, 1-05, 1-50, 1-77, 203, 2-25, 2-42. Allowing for errors 'of observation, it was desired to test if there was a relation between y and x of the type y^a+bx^- . . . (1) In the first place, the shape of the curve obtained by plotting y against x, as in fig. (6), would, to the initiated, probably suggest a parabola, the equation of which is of type (1). In order to test its suitability we proceed to plot y against x^, or, putting x^=^, we plot y against |. If equation (1) holds, then, in that case y=a^bi . . . (2) should also hold, and this, in (^, y) co-ordinates, represents a straight line. The result of plotting y against | should therefore be a number of points approximately in a straight line — we say ' ap- proximately ' to allow for errors of observation in the original data. 74 STATISTICS Now from the given statistics corresponding values of ^ and y are, since ^—x- : — 1=1-00, 2-25, 4-00, 5-29, 6-25, 7-29, 7-84 ; y=0-n, 1-05, 1-50, 1-77, 2-03, 2-25, 2-42 ; Y ~ 2-5 1 / J J { 20 f / / i / ' 1-5 r / / / 10 ; /* ^ <» .^ - 0-5 - r> _ __ _ _ _ ^ _ ^ H _ _ i_ 0-5 10 1-5 20 Fio. (6), 2-5 30 and the resulting grajih, fig. (7), is very approximately a straight line. To determine its equation, choose two points (not too close together) on the line, which has been drawn so as to run as fairly as possible through the middle of the points plotted, and, in choosing, take points which lie at the intersections of horizontal and vertical cross lines (the printed Unes of the graph paper) if such can be Y 4 5 F.o. (7). i found, because their a;'s and 2/'s can be read off with ease and accuracy. Two such points are (2-8, •1-2) and (6-0, 2-0), GRAPHS 76 and since each of these points lies on the Hne whose equation is we have Subtracting, we get Therefore l-2=a+6(2-8) 2-0=a+6(6-0). 0-8=6(3-2). b=l Hence a=2— f=J. Thus the equation of the hne is i.e. 4y=|+2, and the law coiuiecting x and y is therefore 4y=x^-{-2. The follo\\ing statistics, the result of an experiment in Physics to verify Boyle's Law, may be treated in the same way. .r is a number proportional to the volume of a constant weight of gas in a closed space, and yis a, number proportional to its absolute pressure. Corresponding values of x and y observed were : — X— 46-89 41-9G 40-33 38-88 37-37 36-06 34-71 33-47 y^ 76-32 85-38 88-93 92-36 96-09 99-61 103-51 107-51 ^x= 32-39 31-08 29-97 28-76 27-26 25-32 24-04 [?/=lll-09 115-69 120-05 125-08 131-99 14209 149-81. Boyle's Law states that the product xy is constant, and this may be tested by putting |=- and plotting y against | ; the jjoints obtained X should be approximately in a straight hne. Now in Statistics, as we have already explained, the exact con- nection between the variables, x and y, is rarely so clear, though the absence of law is not so complete as it might seem at first sight. At this stage, however, we need not enter into the difficult question of curve fitting : if draA\Ti M-ith care and used with judgment much that is of value may be learnt by simple plotting and by connecting up the resulting points by straight lines or a freehand curve. We shall briefly explain or illustrate by examples how graphs and 76 STATISTICS graphical ideas may be- used to serve three distinct purposes, namely : — (1) to suggest correlation or comiection between two different factors or events ; (2) to supply a basis for finding by interpolation some values of a variable when others are known ; (3) -AS pictorial arguments appealing to the reason through the eye. We reserve (2) and (3) for the next chapter and proceed at present with an example of (1). Correlation suggested by Graphical means. Consider the index numbers, col. (2) Table (20), showing the variation from year to year in wholesale prices between the years 1871 and 1912. It is not an easy matter to take in satisfactorily the meaning of such a mass of bare figures, but they are much easier to grasp when plotted in a graph. In this case the numbers x, representing years, and the numbers y, representing prices, are measures of things of quite a different char- acter, so that it is not necessary to take the x and y units of the same size. Moreover they need not, in a case of this kind, neces- sarily vanish at the origin, but it is convenient to draw the graph in such a way that it shall occupy the greater part of the space at our disposal. Thus, we have roughly 80 small squares across the breadth of our graph paj)er, and between 1871 and 1912 we have roughly 40 years ; we therefore take two sides of a square to 1 year and mark off the years 1870, 1875, 1880, . . ., along an axis or base line parallel to the breadth of the paper, as shown in fig (8). Again we have roughly 70 small squares in the available space from this base line to the top of our graph paper, and the whole- sale price index numbers vary from 88-2 to 151-9, a range of 63-7 ; we therefore take one side of a square to correspond to a difference of 1 in the price index number, and mark off the prices 90, 100, 110, ... , along an axis parallel to the length of the paper, as shown in the figure. We then plot points to represent the numbers in col. (2) of Table (20). Thus, in 1880 wholesale prices stood at 129 ; we there- fore travel along the width of the paper till we reach 1880 and then upwards until we are opposite the 129 level on the axis of prices, inserting a dot to mark the position. Similarly for all other points, and the required graph is given by joining them up in succession. GRAPHS 77 Table (20). Marriage Rate and Wholesale Prices Index Numbers. (1) (2) (3) (4) (5) (6) (7) Nine Years' Difference be- Nine Years' Difference be- Year. Prices. Average tween Nos. in Marriage Average of tween Nos. in of Prices. Cols. (2) & (3). rate. Marriage rate. Cols. (5) &(()). 1871 135-6 167 1872 145-2 174 1873 151-9 176 1874 146-9 170 . . 1875 140-4 139-3 + M 167 164 + 3 1876 137-1 138-6 -1-5 165 162 + 3 1877 140-4 136-5 + 3-9 157 159 - 2 1878 1311 133-8 -2-7 152 157 - 5 1879 125-0 131-5 -6-5 144 155 -11 1880 129-0 128-5 + 0-5 149 153 - 4 1881 126-6 125-2 + 1-4 151 151 1882 127-7 120-8 + 6-9 155 149 +' 6 1883 125-9 117-2 + 8-7 155 148 + 7 1884 1141 114-7 -0-6 151 148 + 3 1885 107-0 111-8 -4-8 145 149 - 4 1886 101-0 109-2 -8-2 142 149 - 7 1887 98-8 106-9 -8-1 I 144 149 - 5 1888 101-8 104-2 -2-4 144 149 - 5 1889 103-4 102-5 +0-9 150 149 + 1 1890 103-3 101-0 +2-3 155 149 + 6 1891 106-9 99-9 + 7-0 156 150 + 6 1892 101-1 98-7 + 2-4 154 151 + 3 1893 99-4 97-4 + 2-0 147 153 - 6 1894 93-5 96-3 -2-8 150 155 - 5 1895 90-7 95-0 -4-3 150 156 - 6 1896 88-2 94-3 -61 157 156 + 1 1897 90-1 93-8 -3-7 160 157 + 3 1898 93-2 93-4 -0-2 162 158 + 4 1899 92-2 93-8 -1-6 165 159 + 6 1900 100-0 94-7 + 5-3 160 159 + 1 1901 96-7 95-7 + 1-0 159 159 1902 96-4 96-9 -0-5 159 158 + 1 1903 96-9 98-3 -1-4 157 158 - 1 1904 98-2 99-5 -1-3 153 156 - 3 1905 97-6 100-0 -2-4 153 155 - 2 1906 100-8 101-3 -0-5 157 154 + 3 1907 1060 102-8 + 3-2 159 153 + 6 1908 1030 104-8 -1-8 151 153 - 2 1909 104-1 . . 147 . . 1910 108-8 150 1911 109-4 152 1912 114-9 155 78 STATISTICS It is comparatively easy from this graph to trace the change in prices from year to year and from decade to decade : for example, we note that from 1873 to 1896 the tendency of prices was on the whole doAvnward, and from 1896 to 1910 the tendency was upward. Also on the assumption — not necessarily valid — that prices have varied continuously, or at least consistently, during the intervals between the dates to which the records refer, it is possible to read off intermediate values from the graph : e.g. midway between 1883 and 1884 we get the figure 120 as the index number for prices. On the same graph sheet we have also plotted the marriage rate from year to year during the same period. The numbers are given in col. (5) of Table (20). This rate varies from 142 to 176, a range of 34, and we have a range of 40 small squares at our disposal in plotting ; a difference of 1 in the marriage rate has therefore been taken to correspond to one side of a square, and the marriage rates 140, 150, 160 .. . are accordingly marked along the axis perpen- dicular to the same base line as before, which is used again to measure the passage of years, but the second graph is drawn below the line whereas the first was drawn above it. In this way we are able to compare the two graphs, namely, the one registering the change in prices and the one registering the change in marriage rate from year to year. It is interesting to observe that the two seem to be not uncon- nected : they go up and down almost in the same time, and moun- tains and valleys in the one correspond roughly to mountains and valleys in the other ; in other words, there is some kind of correlation or reciprocal relation between them. Now these mountains and valleys are largely the result of what may be caUed short-time fluctuations, and it is important to distinguish between these changes which are transient and the more permanent or long-time changes. In order to get rid of the former, which sometimes conceal the latter, the following device has been adopted : noticing that the wave period, the length of time taken for each complete up-and- dowTi motion, is one of about nine years, nine-yearly averages have been taken of the figures for wholesale prices right down col. (2) of Table (20) ; thus 139-3 is the average of the index numbers from 1871 to 1879 inclusive, 138-6 is the average of the numbers from 1872 to 1880 inclusive, and so on, the results being recorded in col. (3). When the points corresponding to these numbers are plotted we get the broken line in fig. (8) passing through the body of the original graph of prices and indicating its general trend in the course of years as separated from the temporary fluctuations. GRAPHS 79 1870 1875 1880 1885 189c 1895 1900 1905 I9I0 Fio. (8). Graph showing Variation in "Wholesale Prices Index Numbers. j^j 4 1'^-^ 3^ ^^j ^7^\ 1870 1875 i88p 1885 1890 1895 1900 1905 1910 Fio. (9). Graph showing Variation in ]\Iarriage Rate Index Numbers. 80 STATISTICS The same procedure has been followed with the marriage rate statistics ; the nine-yearly averages are showTi in col. (6) of Table (20), and their graph appears as a broken line passing through the body of the original marriage rate graph in fig. (9). G'ra )h 1 sHowin'g Fluctuations from theirlNine-yearlv AveraWe 5 ^> / 0 ft h^Iiid ex Numbers of Wholesale Prices 3 / «] « 1' \ I P+5 / I \ • 1 •^ i 1 / I ( ^> J E 5 < J J I / i / "? n V / n r \ \ / S i8; \ / i( 8o 18 8.S 1 1890 \ 189. 3 V '15 00 N »v, 19 3S 19 10 s \ [ \ 1 N \ ,, «] 1 \ ' \ \ "C \ \ \ 1 •S-5 OS \ : \ / V , \/ s V I / 1 it^ r— < ^^-10 +10 91 ^ u =) / I 1+5 \ ^ / 1 \ \ I / 1 i S 1 1 / \ 1/ 1 \ .fc 1 / 1 1 o \ / I \ ? 0 \ 1 1 s / V \ §> '8: s \ i8 80/ ii 88^ 1 J 18 qo 1 i8q< i ?/2)> the curve connecting them is a straight line, and the y corresponding to any other x is at once given geometrically, as fig. (12) shows, by PM PiM P2M2 P1M2 i.e. or 2/2 Vi •"^2 •*'i y-z-Vi y=yi+' tl/n JU-t {X-X^), the familiar proportional relation which is employed in this simple case. F 3^__,„-^-^ F>^ "^ M y. V Example. — Given Required log 5-826736. Fig. (12). log 5-82673= 0-7654249, log 5-82674=0-7654257. GRAPHS 87 Here a:,=5-826730 ?/i=0-7654249 ^2=5-826740 ?/^=0-7654257 a;=5-826736. Therefore, by means of the above relation, 0-000010 =0-7654249+0-00000048 =0-7654254. The logarithmic curve y—logx is, of com-se, not a straight line, and the value obtained for y only represents a first approximation to the true value. When more than two points are given there is bound to be a margin of inaccuracy, more or less according to the data, intro- duced in drawing the curve. For an example of this method the reader may refer back to the curve on p. 67, which was used to determine the median and quartiles. We may, as we saw, read off from it the number of candidates who obtained not more than any stated number of marks : e.g. 300 candidates obtained not more than 34 marks ; or we may use it the other way round and find the number of marks obtained by a stated number of candi- dates : e.g. 10 per cent, of the candidates got less than 17 marks. Such examples might be multipHed endlessly, and the method will be found extremely useful when a liigli degree of accuracy is not looked for. But greater confidence will be felt perhaps in such results — though the foundation for it may be no more secure in many cases — if we can translate them from geometrical to algebraical form, if we can find, that is to say, some formula, like the simj)le proportional relation already introduced above, which will give one y when others are knoAMi. In order to make the argument as general as possible we shall speak of x and y as variables, and we shall think of the value of y as depending upon that of x in such a way that when x is given, y is known or it can be estimated * (in the sense that when the year is given the population is known or can be estimated). Suppose t/=Co+qa:+C2.r2-J- . [* This is equivalent to assuming that y is some function of .r, say y=J\x), and clearly some such assumption is necessary if any estimate from the known values to the unknown is to be possible. Further, for simplicity we assume J\.c) can be expanded in a Maclaurin's converging series of ascending powers of x, which simply means that we take the relation between x and y to be of the form adopted above.] 88 STATISTICS where the c's are constants to be determined, and their number can be made to depend upon the number of knowTi values of y which are used in the estimate. Geometrically, the equation 2/=Co+Cia:+C2a;2+ . . . +c„a;" represents a curve called a parabola of the nth. order, and such a curve could be employed (and uniquely found — there is only one parabola of the Idnd which will go through all the points) if we based our estimate upon a knowledge of {n-\-\) y's corresponding to given .r's, for we could readily make it pass through the (w-f 1) known points {x^, Vq), [x^, y^), {x.^, y.^, . . . {x^, ?/„) by choosing the (w+1) c's so as to satisfy the (/? + !) simple linear relations : — 2/i=Co+Ci.ri+C2a%2^ . . . +c„a:i« 2/2 '^^o'T ^1*^2 "rC2'^2 "r • • • ~rCn'^2" 7/„=Co-(-Cj^a:„-)-C2.T„ -(- . . . -j-c„.T„". When the curve is determined, in other words when the c's are known, we can find any other y required by substituting the corre- sponding x in the equation 2/=Co+Cia;+Coa;24- . , . +c„a;«, i.e. by supposing this point {x, y) to lie on the same curve that goes through the known jDoints. It is well to mention here that the parabola is by no means alwaj^s the best curve for fitting any given statistics, and when the number of observations is adequate it is possible often to make a more satisfactory choice. Once the equation of a suitable curve has been determined the subsequent interpolation or calculation of y for any given x is not as a rule a very difficult matter. The larger question of curve fitting in general is reserved for a later chapter. Example of First Metlwd {fitting with a parabolic curve). Let us illustrate this process of interpolation by fitting a parabolic curve to the folloA\ing figures, extracted from Porter's The Progress of the Nation, giving the annual cost of Poor Relief (excluding insane and casual) at five-yearly intervals, but with the amount for the year 1845 omitted : — Year . . . 1835, 1840, 1845, 1850, 1855] Cost in £1000 . . . 5526, 4577, ? 5395, 5890/ GRAPHS 89 Assuming that no extraordinary conditions prevailed in 1845 to cause abnormality in expenditure, let us estimate what the figure would be for that year judging from the given records just before and after. Since there are four known points in this case, we take as the curve through them a parabola of the 3rd order, namely : — • y=c^j^c^x+c.^x-+c^x^ ; . . . (1) the four known points wUl then just suffice to determine uniquely the four arbitrary constants Cq, c^, c^, Cg. Also, since the x class- intervals are equal, it will simplify the algebra if we measure from the year 1845 as origin, taking five years as unit for x and £1000 as unit for //, so that we get x=—2, -1, 0, +1, +2 \ y=552G, 4577, ^/o- ^395, 5890J where y^ is the number to be determined. Since all five points are to lie on the curve with equation as in (1), we have by substituting in that equation — ■8c, 5526- 4577= =Co- -2ci+4c2 2/0= = Co 5395=Co+Ci+C2+C3 5890=Co+2ci+4co+8c3. Adding the first and last of these equations, 2co+8c2 =5526+5890 .... (2) Adding the second and last but one, 2co+2c2=4577+5395 or 8co+8c2= 4(4577 + 5395) . . . (3) Subtracting (2) from (3), 6co=4(4577+5395)- (5526+5890) . . (4) =4(9972)-(11416) = 39888-11416 = 28472. Therefore yo=Co=£4,745,000. If we only wish to make use of the records for the years 1840 and 1850, the appropriate fitting curve reduces to a straight line y=Co+Ci^x, 90 STATISTICS on which we assume the points (-1,4577), (0,^0). (+1,5395) to He, so that 4577=Co— Ci 5395=C(,+Ci. Therefore, adding the first and last of these equations, 2co=4577+5395, so that yo=Co=£4,986,000. * Second Method {using a formula connecting the ordinates). When, as above, the steps from each x to the next are equal, as commonly happens in practice, it is possible to write doT^n a simple relation between the y's, kno^vn and unloiown, without introducing the c's at all. At bottom the method is the same as the last, inasmuch as the elimination of the c constants by the first method really results in the same formula for the unkno^\Ti y. Let us represent the given statistics in this case by . XQ-^nh\ Vn J so that, if the fitting curve be y=Co+Ci.r+C2Ct;24- , . . +c„X'", we have, by substituting the co-ordinates of the first two points in this equation, 2/l = Co+Ci(.To + ^) + C2(.l-o + /0"+ . • . +Cn(«0 + ^)'' and 2/o=Co+Ci x^ -^H V + - ' • +c„a-o«. Hence 2/i-2/o=Ci;i+C2(2.ro?i+A2)_|_ , . , _|_c^(w.To"-iA+ . . .)• Now this result, which we call the \st differ ence between the y's, is of (w— l)th degree in Xq, so that by subtracting two of the y's we have reduced the degree in x^ by 1. Similarly, 2/2-2/i=Ci/i+C2(2.roA+3/i2)+ .... +c„(n.To"-^/^+ . • .)• Thus we get a series of \st differences, each with the highest term of the {n—\)th degree in Xq. Treating them as a series of new [* The non-mathematical reader will do well to omit the rest of this section on interpolation. ] GRAPHS 91 ordinates and forming their differences in the same way, we get what maj'^ be called the 2nd differences between the y's, a series of ordinates each with the highest term of degree {n—2) in Xq. Proceeding in this way the Zrd differences between the y's are a series of ordinates of degree (w— 3) in Xq, the Aih differences are of degree (w— 4), and so on, until ultimately we reach the nth differences, which are of zero degree in Xq, and consequently involve only h. It follows that the ni\\ differences must all be equal in value and therefore, if we go one steja further and write down the {n-\-\)th differences, these must vanish altogether. If the reader finds any difficulty in following the argument he should test it step by step for himself in the simple case of a parabola of the third order when it should be perfectly clear. The formation of the successive differences is conveniently sho\ATi in Table (23). Table (23). Successive Differences of Ordinates. First Second Tiiird Fourth Fiftli V difterence difference ditlerence difference difference y.-\ A A 2. A^. A4. A5. ( yi-y») vx) \ V2-2.'/]+3/o~| 2/3-22/2+2/1/ 3/2 - 2/1 ) ys- 32/2+ 33/1 -2/0 'I 2/4 -32/3 +32/2 -3/1 J Vl 2/4-43/3 + 62/2-^2/1 + 2/0) 2/3 - 2/i! 2/5 - 53/4+IOV3 - 10!/2+5;/i - 1/0 y-i V4-2/3 2/4-2t/3+!/2 3/5-33/4+32/3-2/2 3/5-43/4+63/3-4^2 + 3/1) y\ 3/5 - 2/4 V5-2J/4+2/3 Vs The law of formation should be apparent from this table, for it is precisely that which we meet in the binomial expansion, e.g. the wth difference is of type n{n—l) n{n—l){n—2) +(-iryo. and by equating to zero the (n-|-l)th difference we have the relation required between the y's. Example. — Let us apply this method to the ' Poor Relief ' example already considered. Since there are four known points the relation between x and y must be of the form as before. Hence the 4th differences must vanish, and taking the 92 STATISTICS points in order from years 1835 to 1855 as (tq, ?/(,), (:rj, y^), (^2, i/j,), (•^^3. yi)r (^4' Va)' we get Z/4— 4^3+%2-42/i+2/o=0 as the formula connecting five ^'s, four known and one {y.^) unknown. Therefore %2=4(yi+2/3)— (2/0+2/4) =4(4577+5395)- (5526+5890), which is equivalent to equation (4) on p. 89, Thus y„ = £4,745,000. Third Method {by means of advancing differences). In the last method we employed a relation connecting y^ with all the preceding «/'s, but it is possible also to express y„ in terms of ?/o and the suc- cessive differences, which may be written /\, A^j A^j • • • A" 5 we have, in fact, with the notation of Table (23) : — Ao=2/i-S'o> A 0^=2/2- 2^1 +2/o> Ao''=2/3-3«/o+3ii/i-?/o, . . . Thus 2/1=2/0+ Ao- 2/2=22/1-2/0+ Ao'=2/o+2Ao+Ao'- 2/3= %2 ~ ^^2/1 + 2/0+ A 0^ =3(2/o+2Ao+Ao')-3(2/o+Ao)+2/o+Ao' =2/o+3Ao+3Ao'+Ao'- 2/4=42/3-62/2+42/1-2/0+ Ao^ =4(2/0+3 Ao+3Ao'+Ao')- 6(2/0+2 Ao+Ao')+4(2/o+Ao) - 2/0+ A 0' =2/o+4Ao+6Ao'+4Ao'+Ao*- Here again the law of formation is clear, and it is readily estab- lished by induction that, for all 'positive integral values of n, ^ , nin—\) ^ „ , nin—\)[n—2) ^ „ , ,_. 2/n=2/o+^Ao+ \ ^ Ao'+ \ 2 3 ^° "^ ^ ^ a series which automatically comes to an end at the term Ao"- An extension of this formula is obtained by writing 6 in place of ?i, where 0<^<1. We then get , . ^ 6(1-6) ^ „ , 6{l-6){2-e) ^ „ ,_, 2/g=2/o+^Ao--Y^Ao^+ \ 2 3 ^" • • •' • • • ^^^ which enables us to interpolate for a, y in between any two of a series of y's corresponding to .r's advancing by equal steps. This relation GRAPHS 93 is no longer identically true as was (5), for the series on the right in (6) is unending, but its application in practice is justified when, as the differences advance, the numbers obtained tend to grow smaller and smaller, so that the remainder after a certain number of terms can be treated as negligible. Unless this tendency is reahzed without carrjing the differences far the formula is not very satisfactory. To illustrate the method of procedure the following figures may be used from Table (7), p. 25: — Table (24). Marks obtained by certain Candidates IN an EXAillNATION No. of First Second Third No. of Marks. Candidates. difference difference diflFerence y A A2 AS Not more than 45 447 37 „ „ 50 484 21 -16 1 »? •» )j 55 505 6 -15 12 „ ., 60 511 3 - 3 » » 5» 6o 514 Suppose now we ^dsh to know the number of candidates who obtained a number of marks not more than 48. In that case, in applying formula (6), we have 1/0=447, ^=(48-4o)/(50-45)=3/5, Ao= 37, Ao'=-l<3, Ao'=l> and hence, up to this order of differences, the required number of candidates is given by 447+1 . 37-I^\-16)+^il^(l) 1.2^ ^1.2.3 =447+22-2+ 1-92+006 =471, approximately. Also, number of candidates obtaining more than 48 marks, but not more than 50 =484-471 = 13, approximately. 94 STATISTICS Fourth Method {by means of Lagrange's Formula). We shall consider one more formula, due to the famous French mathematician Lagrange (173G-1813), which is useful when the recorded y's, corre- spond to x's, which advance by unequal stages. Let the given statistics be represented as before by (•^0' !/o)> (-^l. yi)> (^2. 2/2). • • • K. 2/n). and consider the equation [X X^)\X X'2) . . . \x x^) y=yo -\-y \Xq X^)\Xq X^) . . . (Xq X^) (X Xq)[X X^) . . . (X Xf^) (Xi Xq){Xj X2) . . . (^1 x^) (X X^J){X Xi) . , . (X ^„-i) ,-. +yn^ ^ ,^ , _ , — . ••.(') \X^ Xq){X^ X^) . . . [Xj^ ^n-l) It is of the nth degree in x, and it is identically satisfied by the (n-|-l) pairs of values {x^xo, y=yo), {x=Xi, y=yi), • • • {x=^Xn, y=yn)- It will therefore clearly serve as the fitting curve ?/=Co+Ci.c+C2a;2+ . . . +c„x'», being exactly of this type, and in order to get the y corresponding to any other x we have only to substitute that value of x in (7). Example. — The following figures, based upon data from Porter's The Progress of the Nation, show the age distribution of criminals in the year 1842. Percentage of criminals up to age 25=52-0 {y^). „ 30=67-3 (1/1). „ 40=84-1 {y.,). „ 50=92-4(^3). Let us employ Lagrange's formula to find the approximate percentage of criminals up to 35 years of age, making use of the four ordinates given, and taking a: =35. We have _^^(35-30)(35-40)(35-50) ^^g(35-25)(35-40)(35-50) ^ "^ (25-30)(25-40)(25-50) (30-25)(30-40)(30-50) g^^(35-25)(35-30)(35-50) g^^(35-25)(35-30)(35-40) (40-25)(40-30)(40-50) (50-25)(50-30)(50-40) = _10-4+50-475+42-05-4-62 =77-5. GRAPHS 95 Number of cigarettes bought Fig. (13). Reasoning made Clear with the Help of Graphs or Curves. The graphical method not only produces an instructive picture of a scheme of observations, but it may also be used effectively on occasion to pilot one through the intricacies of economic or similar argument. The eye is a very ready pupil and is quick to pass on what it sees to the mind ; it acts, that is to say, as an ally to the understanding, which might get on A^thout it, but which certainly gets on faster with it. To illustrate this we shall consider the first jDrinciples of an interesting class of curves relating y to supply and demand.* Curve of Demand. Conceive a smoker who buys cigarettes at the rate of x per day, and paj^s for them at the rate of y pence each. Altogether they cost him there- fore a sum of xy pence per day, which is conveniently measured by the rectangle OABC in fig'. (13). Notice that the cost price of each single cigarette is here represented by the area {yx 1), while the total expenditure is represented by the area {yxx). Now let us suppose his country is at Avar and that the smoker, to put himself in a position to discourage luxuries, decides to give up smoking. Let us try to measure in terms of pence the cost of this great sacrifice to him on the first day. The first cigarette is probably the hardest to do Avithout, and the desire for it is so strong that, if it Avere a mere matter of monej^ and not of patriotism, ~X he Avould be Avilling to give as manj^ pence as are represented, say, by the rectangle 1-1 in fig. (14) in order to haA^e it to smoke. If he went on to bargain S-C 12 3 4 Number of cigarettes bouglit Fig. (14). [* A fuller account of these curves will be found in Cunynghame's Geometrical Political Economy, where a rather more accurate interpretation of "surplus value " is given, involving the introduction of subordinate curves. The simplified statement here adopted seemed sufficient in an introductory course. Marshall's Principles of Economics also contains many fascinating illustrations of the use of such curves, mainly in footnotes. ] 96 STATISTICS with himself in imagination, he would not be ready to offer quite so much for the satisfaction of a second smoke soon after the first : he would perhaps only give a number of pence represented by the rectangle 2-2 in the figure for this second cigarette. And if it came to a third he would offer less still, only ' 3-3 ' pence perhaps, for the fourth ' 4-4 ' pence, and so on. The rectangles here are of varjnng height, but each stands on a base of unit length. Thus we find that the total sum he would be prepared to offer, bargaining for cigarette after cigarette in this way, would be repre- sented by the sum of the rectangles 1-1, 2-2, 3-3 ... in fig. (14), where the addition of each unit length along OX means one more cigarette in imagination smoked, and a diminution of unit length in an ordinate parallel to OY means a reduction of Id. per cigarette in the })rice the smoker would be prepared to pay. But if he fell a prey to his persistent cra\'ing and actually bought a number of cigarettes represented by OA in the figure, each would cost him in the ordinary way only a number of pence represented by AB, say, i.e. area (ABx 1), and his total expenditure Avould thus be measured by the area of the rectangle OABC. He would get them, that is to say, for less than he would be prepared to give rather than go without them. The difference, the area of the rectangles making up the portion BCDE of fig. (14), represents the measure in pence of surplus enjoyment wliich he would obtain free of charge, or it represents the measure of free sacrifice he makes if he is true to his patriotic principles. Let us now take an example on a larger scale. Imagine a small community of people, producers and consumers, buy- ing and selhng among them- selves. Some of them are coalowTiers and sell coal to the others in the open market, where competition is supposed free mid unrestricted in any way. This last condition is emphasized, because it is seldom perfectly satisfied in the real world of commerce. Just as in the previous case we may represent the number of cwts. of coal bought by a length OA measured along OX in fig. (15), and the price actually paid in shillings per cwt. by the area of a rectangle on unit base and of height OC along OY. Thus the 12 3 4 A Number of cwts. of coal bought FiQ. (15). GRAPHS 97 total cost to the consumers in shillings is measured by the area of the rectangle OABC. But here again we may picture the consumers during a coal shortage, when, rather than go without the first cwt. of coal, some one among them would be ready to offer for it as many shillings as are represented by the rectangle 1-1 in fig. (15), and for the second c\H. some one would be ready to offer ' 2-2 ' shillings, for the third ' 3-3 ' shillings, and so on. The demand for coal could thus be measured in shillings by the sum of the rectangles 1-1, 2-2, 3-3 . . . and, if OA runs into thousands of units of coal, the lengths 0-1, 1-2, 2-3 . . . along OX, corresponding to additions of 1 c^vt. in the quantity bought, would in the limit be so small that the sum of the rectangles would become practically equivalent to the curvilinear area OAED in the figure, where DE is a curve drawn through the summits of the rectangles, namely the curve of demand. The consmners' surplus in this case would be measured in shillings by the area BCDE, this being the difference between the measures of the sum actually paid for the coal bought and the sum consumers would have been willing to pay rather than go without it. Curve of Supply. Now let us consider the question from the point of view of the coalowners. We shall assume that the average cost of production per cwt. of y coal increases steadily as the number of cwts. produced in- creases ; this would not be an unreasonable assumption in most cases after passing a certain point, ^ « since the richer coal measures ° :- ^ known are likely to be mined o "S before the poorer ones, and the cost of mining near the surface is bound to be less than when deep shafts have to be bored. If, then, OA, fig. (16), represents the number of cwts. of coal sold, and if the price in shillings per cwt. at which it is sold is de- noted by the area of a rectangle on unit base and of height OC along OY, the total payment received by the coalowners will be measured in shillings by the area of the rectangle OABC. But the cost of producing the first cwt. is perhaps measured by the rectangle 1-1, that of producing the second cwt. by the rectangle 2-2, the third by the rectangle 3-3, and so on, each rectangle being draAvn on unit base representing an advance of 1 cwt. (The G 12 3 A Number of cwts. of coal sold Fig. (1G). 98 STATISTICS advance in the cost of production would not in reality be measured by so much the cwt. of course, but the assumption is inaccurate in degree only, not in principle, and, by making it, the argument is rendered clearer.) Thus the actual cost of production is, in the limit when OA is very large and divided up into relatively very small parts, measured in shillings by the curvilinear area OAED, where DE is a curve dra^vii through the summits of the rectangles namely, the curve of supply. The difference, BCDE, between the areas OABC and OAED represents what is known as producers' surplus, for it measures the profit made by the owners in selling the coal at a higher price than the cost price of production. Now let us combine the curve of supply (S.C.) and the curve of demand (D.C.) in the same figure, fig. (17). Their meeting point P determines the number of cwts. of coal bought (x), and the selhng price in shilUngs j)er cwt. (y). For it is clear that under normal conditions it would not be profit- able to coal producers to pass this point, because beyond it the de- mand on the part of coal consumers measured in money is less than the cost of production : they are not willing on the average to pay so much as ys. per cwt. for it, and it costs more than ys. per cwt. on the average to produce. If, on the other hand, the amount of coal produced decreases below X cwts., the greater this decrease the higher does the profit become on the sale of it, because the greater is the difference between the cost price and the selling price ; hence, as profits become more pronounced, recruits wiU be attracted into the coal-producing business, and, if this goes on, deeper shafts will have to be bored and poorer fields worked until profits begin to decrease again and the supply once more approaches x cwts. Thus sooner or later the j)roduction of coal and its market price will tend to the level determined by the equilibrium point P where the supply and demand curves meet. Endless varieties of problems may be discussed by altering the conditions and observing the effect produced in the standard diagram. Three examples Mill suffice to illustrate the method. Number of cwts. of coal bought or sold S.C = Stipplij curve D.C. -Demand curve Fig. (17). GRAPHS 99 1. Effect of a Change in Normal Demand. Here we suppose the normal conditions of supply are unaltered — it costs just as much as before to produce the same amount of the commodity in question ; but a more eager demand on the part of consumers shows itself in a readiness to purchase more at any given price than would have been purchased under the old conditions : this may conceivably be due to a general increase in the purchasing power of these con- sumers, or it may be the result of a shortage of some other com- modity which causes this one to be more widely used, just as margarine, for instance, has been known to take the place of butter ; whatever the reason may be, the effect is that the demand curve now occupies a higher level throughout its length, D'C in place of D.C. in the figures. When we turn to the supply side of the question, there are three Y Decreasing Return stages which, although they shade into one another in practice, it is well to separate clearly in theory : (1) the only supplies immedi- ately available are those actually in the hands of dealers ; (2) to meet the increased demand, and so earn for themselves increased profits, manufacturers will speed up production, b}' working over- time, etc., with the help possibly of any disengaged labour or capital they may be able to secure, and the resulting extra supplies will be available after a short time ; (3) if the demand continues unabated manufacturers, by offering higher wages and interest, will seek to attract fresh labour and capital from other engagements into their business, and, by renewing their machinery and generally improving their organization, they will produce on a larger and relatively more economical scale. Moreover, other manufacturers, seeing the profits to be earned. wiD be attracted into the same line of business also, so that by this time the current available supplies of the commodity may exceed very appreciably their old figure. 100 STATISTICS But all this happens only in the long run, and the economist has always to bear this extremely important element of time carefuUy in mind when he seeks to estimate the effects of any proposed action. We assume then that the new demand remains long enough at its higher level to allow for the gradual adjustment in this way of supply to the changed conditions, and for the economic forces called into play once again to arrive at a balance between them, most likely at a new equilibrium point. The two figures illustrate the difference in effect according as the produc- tion of the commodity is ncreasing e urn subject to a decreasing or an increasing return, i.e. according as the cost of production rises or falls when the amount produced is increased. In both cases it will be noted that more of the commodity is produced (ON' in place of ON) in answer to the keener demand, but the difference is much greater in the second case than in the first. Also the jorice has gone up in the first case, while in the second it has gone down, the difference being measured by the change in PN. 2. EJfect of a Tax. If the Y tax is at the rate of so much per unit (say Is. per unit, if the price is measured in shil- lings) of the commodity pro- duced, this will raise the supply curve, S.C., bodily up a distance of 1 unit into the position S'.C, fig. (18), be- cause the effect is the same as if Is. were added to the cost of each unit in produc- tion. The production will thus be diminished by N'N units, for P' is the new equihbrium point ; the selling price will be increased by P'Ms per unit — by less, it should be noted, than P'Q or K'K, the amount of the tax ; producers' surplus, which is analogous to what economists term GRAPHS 101 -en/, is diminished by (area KPL— area K'P'L')s ; consumers' surplus is diminished by (area PLL'P')s ; finally, the tax produces for the Treasury a number of shillings represented by a rectangle with sides of length ON' and KK'. 3. Effect of a Monopoly. A monopolist has the power to stop production short of the true equilibrium point, so that ON' cwts., fig. (19), are produced in place of the ON cwts. which free competi- tion would demand. The selling price is thus raised by Q'Ss. per cwt. ; producers' surplus is increased by (area KP'Q'M'— area KPL)s ; while consumers' surplus Y is diminished by (area PLD— area DM'Q')s. A word of explanation is neces- sary before leaving the subject of these supply and demand curves. It is probable that the reader will have questioned the possibility of draAving such curves for any com- modity with sufficient accuracy to be of any value, but it Avould be enough as a rule to be able to estimate what would happen if a slight variation occurred in price or in production, and such an estimate may sometimes be made by actual trial : e.g. a good practical farmer most Hkely knows nothing about supply and demand curves as such, yet from past experience he has a pretty shrewd notion as to how far it may be profitable to spend an extra pound here in rearing calves and a pound less there in cultivating crops, bearing in mind the prices which cattle and com might be expected to fetch. From his point of view the interest of the curves, if he knew anything of them, would be centred in those portions which correspond to normal conditions, i.e. somewhere in the neighbourhood of the equilibrium point under the free play of ordinary competition. ' Their real value, however, as suggested at the beginning, does not consist in the practical assistance which they afford to the pro- ducer or consumer, by way of foretelling the actual measure of consumption or production, so much as in the fight they throw upon general tendencies which are rather apt to be obscured if they are ponderously presented with elaborate economic argument. They make plain in a moment to the eye what can only be stated in two or three pages of writing. CHAPTER X CORRELATION One of the most important questions which can be discussed by statistical methods is that of possible connection, or correlation, as it is called, between two sets of phenomena. If some factor in each can be isolated and measured numerically, our object is to discover if the size of either is sympathetically affected when a change occurs in the size of the other ; or, to put the matter in another way, do large values of the one factor go with large values of the other factor and small with small, or vice versa ? And, if some mutual dependence of this kind exists, can an estimate of its extent be made ? Consider, for example, the factor or character of height in husband and wife. Is there any connection between stature of husband (x) and stature of wife (y) ? Do tall men tend on the average to wed tall women, or do we find tall men choosing short women for wives just about as often as they choose tall women ? When correla- tion exists we shall want some measure for it which will tell us the amount of change or devia- tion from the average in either character associated with a given change or deviation from the average in the other. In studying graphs we saw how some hint of the existence of correlation might be discovered, but we wish now to go a little more deeply into the subject. The first step is to measure an adequate number of pairs of values, x and y, of the characters concerned in order to find what values are associated together, and how frequently the same values are repeated. When this is done we can draw up a table of double entry, see fig. (20), setting out in rows and columns the frequencies observed. An examina- tion of Table (25), showing the variation of brain weight with age 102 -Vi .V, ^3 ^/ ^1 -j; ^'f Fig. (20). CORRELATION 103 in the case of 197 Bohemian Avomen, Mill make clear what is meant. The x's from x^ to a;^, and the i/s from y^ to y,^ are supposed to ascend in magnitude, and when, for example, the pair of values (^2> ^3) ^s observed to be repeated nine times, the number 9 is placed in the second column and third row of the table, so that the frequency of each class is found recorded in the square proper to it : thus, out of the sample in Table (25), there are 10 women between the ages of 40 and 50 with brains weighing between 1300 and 1400 grams. Table (25). Variation of Brain Weight with Age in the Case of certain Bohemian Women. [Data from Biometrika, vol. iv. pp. 13 e^ seq.. Variation and Correlation in Brain Weight, by Raymond Pearl.] Age in years ^1 20-30 ^2 30-40 ^3 40-50 50-60 60-70 70-80 Totals CO J3 Si -s: 5 .i '5 y. 1000-1100 1 - 1 1 - - 3 1100-1200 2 2 4 2 5 4 19 ^3 1200-1300 28 9 8 14 10 4 73 1300-1400 26 14 10 6 5 4 65 1400-1500 13 7 7 2 - 2 31 1500-1600 2 3 - 1 - - 6 Totals 72 35 30 26 20 14 197 Mean y 1325 1350 1310 1285 1250 1279 When each class interval, as in this table, includes a small range of' values, the x and y may, as an approximation, be taken as the mid values of their class intervals : y^ would be taken, for instance, as 1250, though it really includes all values between 1200 and 104 STATISTICS 1300 grams. Strictly in such cases each suigle observation is not, geometrically speaking, located at a definite point, but lies some- where within a small area, though it is treated as if it had the values X and y which apply to the centre point of the area. It is some- times possible to correct for this assumption by what is loiown as Sheppard's adjustment, but we shall not concern ourselves with the correction in the present discussion, so as to avoid complications, because the difference made is not generally large. The table, when drawn up, may immediately suggest some intimate connection between x and y. It may indicate that as X increases y also in general increases, or that y tends to fall in value as x grows bigger. But a more refined analysis is necessary. It would be instructive perhaps to travel along the row of x's, find- ing what mean value of y is associated with x^, what mean value of y is associated with x^, and so on. This would give a sounder basis for judging whether, as x increased, y in general increased or decreased as the case might be : for example, in Table (25) the m.ean values of y associated \\dth the several types of x are shown in their proj^er columns at the foot of the table and clearly, as X increases, y tends to decrease, apart from conflicting readings at the beginning and end of the table, and the latter of these may not be significant of any real difference in brain weight at the end of life, for it is only based on fourteen observations ; generally, the inference from this table would be that the weight of the brain decreases as the age increases after maturity is once reached, although, of course, it would be rash to make more than a tentative statement with so small a sample at our disposal. Let us suppose ?/i to be the mean value of y associated with x^, y^ the mean value of y associated with x.^, y^ with x^, and so on. If these values [x^, y^), (.To, y-z}, {^z^ Vs)' ^^'^■' ^^^ plotted, it is very often found that they cluster more or less closely about a straight line, see fig. (21), so that we are led to ask whether there is not some line which will very fairly describe the run of the points ; the equation of such a line would be y=mx-\-c, and if m and c were known we could find from this equation the best average value of y corresjoonding to any given x. But, on reflection, ^u ^2' Vz • • - '""^^^ themselves only the best ?/'s corresjoonding to the particular values x^, x^, x^ . . . oi x, so that the problem is really the same as that of finding the relation y=mx-\-c, CORRELATION 105 based on all the observations, which A\'ill enable us to estimate the best y corresponding to any given x. Now for any vahie x^ of x the value of y given by this relation is (maJi+c), while by observation we may find more than one value of y corresponding to the value x^ of x. If y^ be one such value the difference between it and the value given by the above rela- tion is (mxi+c)— t/j. This difference we may regard as the error made in estimating y from the relation instead of taking the value given by observation Y r> ' li, ' u IJj;li,v'U witig-theJ^eanBt-ein-|Vv^ig.^.o| , > . asso :ia ;e d wlh vaiin,^ Ajiit ypes !s ^ 1 s. S V CO '32^ * *S^ R 5 V : c • N 5 ^k \ &> ^ ^S. - § '275 ^. ^ s: V e ^V ^y. s \ ^. >^ _ s, . ^w >, 1220 O20 30 40 50 60 Age in years Fro. (21). 70 80 which for the moment we think of as the true value. The best relation will then clearly be the one wliich makes all such errors of estimate as small as possible. But, algebraically, some of these errors are positive, i.e. the value of y given by the relation is greater than that given by observation, and some are negative, and it is only their magnitudes that we wish to take into account. Accord- ingly we follow the method used in finding the standard deviation in order to get rid of the ambiguities of sign : we form, that is to say, the sum of the squares of the errors, because the expression so formed will clearly be least when each separate error is as small as possible in absolute magnitude. 106 STATISTICS To find, then, the values of m and c which will make {mx^+c—y^f+{mx^_+c—y^Y-\- . . . +(m.r„+c— ?/„)2 a minimum (see Note 7 in th3 Appendix), where n is the total number of pairs of observations. The required values are given by differentiating, first with regard to c treating m as constant, and then with regard to m treating c as constant, putting each result equal to zero. Thus (ma;i+c-?/i).ri+ . . . +(m.T„+c-v/„).x-„=0 Therefore m{Xi+ . . . +.r„)+wc— (?/i+ . . . +yn)=0 w(.V4- . . . -|-,r„2)4-c(.Ti+ . . . +xJ-{x^iji-\- . . . x„y„)=0. The first of these equations gives 'm{nx)-\-7ic—{ny)=0, i.e. J>.- mx-\-c—y=0, where x is the mean of all the x's and y is the mean of all the y's, and it expresses the fact that the line y=mx-\-c passes through the point {x, y). This might have been expected, for, graphically, each pair of observations (a^i, ?/]), (a;2, 2/2)' (^3' 2/3) • • • corresponds to some point, and if we look for the line y=^mx-\-c passing through the region where they cluster most thickly together we should certainly expect it to pass through their mean or centre of gravity {x, y). This suggests how the values of m and c may be considerably simplified. If we measure aU the x's from x, their mean, and all the ?/'s from y, their mean, which is equivalent to taking the point {x, y) as origin and replacing every x by its deviation f from x and every y by its deviation '*} from y, the first of the above relations is reduced to c=0. and therefore the second becomes Hence where p is the mean of all the product pairs |>/, and 0-3. is the standard deviation of all the x's. m(^,2+ . . . +^«')-(^i'/i+ • • • +^„VJ=0. w»=(^l^l+ . . . +LVn)l{L'+ • • . +in') =nplnax^ =PI(Tx^, CORRELATION 107 Thus the required equation for estimating the best >? correspond- ing to any particular f is whence y—y= — {x—x) . . • (1) The coefficient pja^^ in this equation evidently gives the deviation in y from the mean y corresponding to unit deviation in x from the mean x, for when (a;— x) = l, {y—y)=p!(J^. Hence the greater this coefficient is, the greater will be the change in y resulting from, or at all events coexistent with, unit change in x. Thus Jpjcrj' would seem to supply a not uiu'easonable measure of the correlation between x and y. But there is something very unsymmetrical about this result. Why should the correlation be measured by pjcT,^ any more than by l^loy^ ? In fact, we might repeat the whole of the previous argument, interchanging x and y throughout wherever they appear. In that case we should first travel down the column of y's and calculate the mean values of x associated with yi- yi-, y^ . . . respectively. This would give a set of points {xj^, y^), {x^^, y^), {x^, y^), . • . , which, when plotted, would perhaps lie approximately in a straight line. We should thus be led to look for some relation x=m'y-\-c' which would enable us to estimate the best average x corresponding to a, y oi given type, and, proceeding just as before, we should ultimately obtain the equation or ^x-x)=l{y-y), . , . (2) in which the coefficient J)l(^y~ gives now the deviation in x from the mean x corresponding to unit deviation in y from the mean y. Hence pfay^ has, seemingly, Just as much claim as j:)/^^;^ to measure the correlation between a* and y. The one gives the change in x corresponding to unit change in y : the other gives the change in y corresponding to unit change in x ; and the only reason why they differ is because unit change in x does not mean the same thing as unit change in y : their standards of changeableness or variability are not equal. If then we could alter the scales of measurement so that unit change in each were of the same magnitude, the two coefficients obtained ought to become identical, and we should then have a really satisfactory measure for the correlation required. 108 STATISTICS With this object let us examine the variabiUty of the x's and compare it with the variability of the y's. Now the total dispersion of the different x's on either side of x, the mean x, is conveniently measured by a^., their standard deviation. And similarly the dispersion of the y's on either side of y, the mean y, is measured by cTy. The bigger CTj. is, the greater is the variability of the x's, and the bigger ay is, the greater is the variability of the y's. Hence, in equations (1) and (2), (x—x) should be divided by o-a; ^^^^ iy~y) by ffy if we want to work with the same unit of change or variability in each case. The equations then become y- '—y\_ p /x—x\ ^y I ^x^y^. ^x I x—x\ p iy—y and ' \ -f IJ J Write r=p/(Tjfry ; then r is taken to be the coefficient of correla- tion, for it measures the change in either character corresponding to unit change in the other when the tmits are made comparable. The lines giving the best y for a given x and the best x for a given y may now be "wnritten y—y^r-'ix—x) and x—x~r-{y—y), Oy and they are called lines of regression. The term regression was first used by Sir Francis Galton in a paper entitled Regression towards Mediocrity in Hereditary Stature, though the root idea is not by any means confined to characters affected by heredity : it holds for any pair of correlated variables. Galton found that if a number of tall fathers are selected and their heights measured* the mean height being calculated* and if, further, the heights of the sons of these fathers are measured, their mean height being like- wise calculated, the latter is not equal to the mean height of the selected fathers, but is rather nearer the mean height of the popula- tion as a whole. There is, that is to say, a regression or stepping back of the variable towards the general average. Professor Karl Pearson has remarked that ' in the existing state of our knowledge the recognition that the true method of approaching the problem of heredity is from the statistical side, and that the most we can hope at present to do is to give the probable character of the offspring CORRELATION 109 of a given ancestry, is one of the great services of Francis Galton to Biometry.' The expressions r— and r— are called coefficients of regression, and they register in the above particular case the amount of abnor- mality to be expected in the height of the sons when the amount of abnormality in the height of the fathers is known, and vice versa. The regression of the sons' height, y, on the fathers' height, x, is, ^^j^i in fact, defined as the ratio of the average deviation of the heights """^^ of the sons from the mean height of all sons to the deviation of the 5 heights of the fathers from the mean height of all fathers, and hence it may be written y __ To make the definition more general, instead of speaking merely in terms of height, we refer to any row or column— for there is no intrinsic difference between row and column — in a table like Table (25) as an array of ?/'s or of .f's, and selecting a particular type, say a particular value of x (like fathers of height x), we define the regression of the corresponding array of y's (like heights of sons of these fathers) on the type x to be the ratio of the average devia- tion of the array of y's from the mean y to the deviation of the selected type x from the mean x. Example. To illustrate, let us take some figures due to Professor Pearson and Dr. Alice Lee [Biometrika, vol. ii. pp. 357 et seq., On the Laws of Inheritance in Man]. Suppose the mean stature of all observed fathers, based on a sample of over 1000 observations = 67-68 in., with S.D.=2-70 in. Also suppose the mean stature of all sons= 68-65 in., with S.D. --2-71 in., and that the correlation r between stature of father and stature of son =0-5 14. The regression of son on father as regards stature is then given by (;?/-68-65)=(0-514)^— (.r-67-68) where x is the height of selected fathers and y the mean height of their sons. Hence ?/-:0-516x+33-73, so that if we selected fathers of height 70 in., for example, the mean height of their sons would not be 70 in., but (0-516)(70)+33-73=69-S5 in., 110 STATISTICS i.e. there is a regression towards the general mean, 68-G5 in., of all sons. Also the coefficient of regression = (0-514)(2-71)/(2-70) =0-516. It is not difficult to show that the greatest numerical value r can in general take is unity, for consider the expression for the sura of the squares of the differences between the observed devia- tions of the y characters from their mean and the corresponding deviations as deduced from the best fitting regression line, y—y=r^{x—x). If, with our previous notation, '/ denote the observed deviation of the one character y, associated with a particular deviation, f, of the other character, x, then, since {rayja^)^ denotes the best value given by the line, the sum of the squares of the differences between these values = (V+ • • • +'/«'')-2r- (^x^i+ . . . +|„VJ+r^— ,(1,2+ . . . +|„2) 2 =nay^—2r-^{nra^ay)-^r^^{na^^) = ?icr/(l — r^). Since the sum of a number of squared quantities must be positive, it follows that r^ must be less than I and hence r lies between — 1 and -fl. Further, n- X Fig. (23) 114 STATISTICS from which we infer that the corrected value of r is E{xy)—nxy We proceed to a few appUcations of these results in the next chajDter. [As early as 1846 a French physicist, Auguste Bravais, had conceived the surface of error as a means of describing in space the path of a point whose x and y co-ordinates are subject to errors which are not independent ; but it appears to be doubtful whether he saw the connection between his work and the subject of correlation. It was Galton, nearly forty years later, who really created that subject, introducing the coefficient of correlation on graph- ical lines and giving practical examples of its use. (See Biometrika, vol. xiii., pp. 25-45, Notes on the History of Correlation.) Edgeworth, in 1892, using Galton's function, independently reached some of Bravais' results related to the correlation of three variables, and showed how they could be extended. Karl Pearson, in 1896, contributed to the Royal Society Transactions a fundamental paper on the subject, with special reference to the problem of heredity, drawdng attention to the best value of the correlation coefficient, and how it should be calculated. (See Appendix, Note 11.) Yule, returning in the following year to Bravais' formulae, showed their significance also in the case of skew correlation. Pearson afterwards developed a method of determining the correlation of characters not quantitatively measurable, and in a discussion of the general theory of skew correlation in another paper he proposed a new function, the correlation ratio, applicable to the case of non-linear regression.] CHAPTER XI CORRELATION EXAMPLES Example (1). — To find the correlation between Differences in Whole- sale Price Index Numbers and in the Marriage Rate from their corre- sponding Nine-yearly Averages during the twenty years, 1889-1908. using tlie data given on p. 77. Table (26). Correlation between Differences in Wholesale Prices and Marriage Rate from their respective Nine- yearly Averages. (1) (2) (3) (4) (5) (6) Year. Difference in Prices from Square of No. in Difference in Marriage-rate from 9-yearly Average. Square of No. in Product of Nos. in Col. (2) and 9-yearly Average. Col. (2). Col. (4). Col. (4). {X) (x2) (Z/) ir) (^.y) 1889 + 0-9 0-81 + 1 1 + 0-9 1890 + 2-3 5-29 + 6 36 + 13-8 1891 + 7-0 49-00 ' + 6 36 + 42-0 1892 + 2-4 5-76 ! + 3 9 + 7-2 1893 + 2-0 4-00 - 6 36 -12-0 1894 - 2-8 7-84 - 5 25 + 14-0 1895 - 4-3 18-49 - 6 36 + 25-8 1896 - 6-1 37-21 + 1 1 - 61 1897 - 3-7 13-69 + 3 9 -11-1 1898 - 0-2 004 + 4 16 - 0-8 1899 - 1-6 2-56 + 6 36 - 9-6 1900 + 5-3 28-09 + 1 1 + 5-3 1901 + 1-0 1-00 1902 - 0-5 0-25 + i 1 - 0-5 1903 - 1-4 1-96 - 1 1 + 1-4 1904 - 1-3 1-69 - 3 9 + 3-9 1905 - 2-4 5-76 - 2 4 1 + 4-8 1906 - 0-5 0-25 + 3 9 - 1-5 1907 -f 3-2 10-24 + 6 36 + 19-2 1908 - 1-8 3-24 197-17 - 2 4 + 3-6 -f 241 -26-6 +41-25 306 + 141-9-41-6 116 STATISTICS The arithmetic is comparatively simple in this case because there is only one value of each variable corresponcUng to each year, so that there is no weighting or grouping to complicate the analysis. The variables x and y, between which we wish to find the correlation, appear in col. (2) and col. (4) in Table (26), and the positive and negative differences are sej)arated from one another in each case so as to make their summation easier. Thus for the arithmetic mean of the numbers in col. (2), wc have .r=(+24-l-26-6)/20=-0-125 ; and for the mean of the numbers in col. (4), we have ^=(+41-25)/20=+0-8. The straightforward procedure would now be to get the twenty corresponding values of ^ and V, the deviations of the twenty x's in col. (2) and of the twenty y's in col. (4) from x and y respectively, and, having found o-j, and Gy, we could immediately deduce r from the formula = (fi^i+ . . . +f2o^/2o)/20a^a,„. But it is simpler to measure the deviations from (0, 0) as origin rather than from the mean (—0-125, +0-8), because x^, y^, and xy involve fewer significant figures than would ^^. v'^, and f v, and, of course, it will be necessary to correct for this at the end in the usual way. The mean square deviation of x referred to zero as origm = 197-17/20, by col. (3). Therefore, cr^^^ 197-17/20- (0-125)2=9-843 a, =3-14. Again, the mean square deviation of y referred to zero as origin = 306/20, by col. (5). Therefore, ct/= 306/20- (0-8)2= 14.66 (7^=3-83. Also the coirected p = {Exy)ln—xy = 100-3/20- (-0-125)(+0-8), by col. (6) = 5015+0-100 = 5-115. Hence f—Vl^^x'^v = 5-115/(3-14)(3-83) =0-43. CORRELATION — EXAMPLES 117 It is necessary to be careful with the signs in forming the numbers in col. (6), but otherwise the actual calculation should present no difficulty. The regression equation giving the best marriage rate difference, Y, for a given wholesale price difference, X, from their respective nine -yearly averages is (Y-0-8)=r^^ . (X+0-125) = (0-43)^--— \x+0-125) (3-14) i.e. Y=0-52X+0-86. The regression equation giving the best wholesale price difference, X, for a given marriage rate difference, Y, from their respective nine-yearly averages is (X+0-125)=r^ . (Y-0-8) =0-35(Y-0-8) i.e. X=0-35y-0-40. We noted that fig. (10), p. 80, suggested a closer correlation between the two factors we have been considering during the earlier years of the period 1875-1908 than during the later years. It might be worth while as an exercise to see if this is borne out by calculating r for the years 1875-1889, and comparing it with the value found for the years 1889-1908. Example (2). — To find the correlation between Overcrowding and Infant Mortality in Lotidon Districts. [Data taken from London Statistics, vol. 23, published by the London County Council.] The figures are apparently based ui^on the Census Report of 1911. The numbers in col. (2), Table (27), show what percentage of the total population occupying private houses in each district were living in overcrowded conditions, any ordinary tenement which has more than two occupants to a room, including bedrooms and sitting-rooms, being defined as overcrowded. The numbers in col. (5) show the infantile mortality in each district, that is, the number of infants who died under one year out of every 1000 bom, including both sexes. For the sake of comparison these numbers have been plotted together on the same graph sheet. The districts, arranged in alphabetical order, were numbered from 1 to 29 so as to form a hori- zontal scale corresponding to the scale of years in discussing prices and marriages. The scale in this case is, of course, purely artificial, 118 STATISTICS and the only reason for joining up neighbouring points is that we are better able by so doing to see whether or not high values of the one variable go ■with high values of the other variable, and low with low. In calculating the mean and standard deviation for overcrowding we have measured deviations from 17-0 as origin, and in making the same calculations for infant mortahtj' we have measured dela- tions from 125 as origin. It is convenient, therefore, to use the point (17-0, 125) as origin in working out also the product deviation sum, col. (8) of Table (27), instead of using the mean (17-86, 126). Table (27). Correlation between Overcrowding and Infant Mortality in London Districts (1911). (1) (2) (3) (4) (5) (6) (7) (8) Per- centage of I 1 Deviation of j Square Infant ; De\iation of Square of No. Product of Nos. District. Popula- No in Co\. (2) ' of No. in Mor- No. in Co .(5) in Col. (3) and tion Orer- from 17 '0. CoL (3). taUty. from 125. in Co1.(6)l Col. (6). crowded M («/) (1) Battersea . 13-3 - 3-7 1-3-69 124 1 1 + 3-7 (2) Bermondsev 23-4 -i- 6-4 40-96 156 ; + 31 961 + 19S-4 (3) Bethnal Green . 33-2 -1- 16-2 ' 262-44 151 + 26 676 -f 421-2 (4) Camberwell . • 13-5 - 3-5 12-2.5 109 - 16 2.56 + 56-0 (5) Chelsea . 14-9 - 2-1 4-41 109 - 16 256 ' 33-tJ (6) City of London 12-3 - 4-7 22-09 124 _ 1 1 289' + 4-7 (7) Deptford . 12-2 - 4-8 23-04 142 17 - 81-6 (8) Finsburv . 39-8 + 22-8 519-84 1.56 31 961 -I. 706-8 (9) Fulham" . 14-6 - 2-4 5-76 125 (10) Greenwich 121 - 4-9 24-01 128 -L. 3' "9 "- 14-7 (11) Hacknev . 12-4 - 4-6 21-16 119 _ 6 36 27-6 (12) Hammersmith . 14-2 - 2-8 7-84 146 + 21 441 - 58-8 (13) Hampstead 71 - 9-9 98-01 78 - 47 2209 + 465-3 (14)Holborn . 25-6 + 8-6 73-96 115 - 10 100 - 86-0 (15) Islington . 20-0 + 3-0 1 9-00 127 + 2 4 -f 6-0 (16) Kensington 17-1 + 0-1 0-01 133 + 8 &4 -f 0-8 (17) Lambeth . 13-6 - .3-4 11-56 123 - 2 4 -L. 68 (18) Lewisham. 3-9 - 1.3-1 171-61 1 104 ' - 21 441 J- 275-1 (19) Paddington 16-2 - 0-8 0-64 127 -f o 4 - 1-6 (20) Poplar . 20-6 + .3-6 12-96 157 -f 32 1024 -1- 115-2 (21) St. Marylebone 20-7 + 3-7 i 13-69 108 - 17 289: - 620 (22) St. Pancras . 25-5 + 8-5 ' 72-25 112 — 13 169 -110-5 (23) Shoreditch 30-6 -1- 19-6 384-16 170 -r 45 202;5 -f 882-0 (24) South wark 25-8 O- 8-8 77-44 144 + 19 361 -T 167-2 (25) Stepney 35-0 + 18-0 324-00 144 + 19 361 + 3420 (261 Stoke Newington 8-8 - 8-2 67-24 102 _ 23 529 + 188-6 (27) Wandsworth . 6-3 -10-7; 114-49 122 — 3 9 -f- 321 (28) Westminster . 12-9 - 4-11 16-81 103 — 22 484, + 90-2 (29) Woolwich. 6-3 -10-7| 114-49 '< 97 "" 28 784 + 299-6 + 119-3-94-4 2519-81 i + 2.56- 226 12748 + 4322-9-416-1 CORRELATION — EXAMPLES 119 For overcrowding, mcan= 17+24-9/29= 17-86 ; c^x= V[ (2519-81/29)- (0-86)2]= v'(86-15)=9-3. For infant mortality, mean=125+30/29= 126-03; a„==V[(12748/29)-(l-03)2]=V438-5=20-9. Ahop, referred to (17-0, 125)=(4322-9-416-l)/29-3907/29, and, referred to the mean (17*86, 126-03), this becomes = 3907/29- (0-86)(l-03) = 133-8. Hence /•=133-8/(9-3)(20-9)=0-69. so that the correlation between overcrowding and infant mortality is fairly marked. 1 12 13 14 15 16 17 13 19 20 21 22 23 24 25 26 27 2S 29 Numbers representing various London Districts Fu>. (24). The regression equation giving the average infant mortality, Y, for districts in which the extent of overcrowding, X, is known is ¥-126-03= :r'-'(X- 17-86) Ox (0-69)(20-9) (X- 17-86) 9-3 i.e. Y=l-55X+98-4. Similarly, the regression equation giving the average j)ercentage of overcrowding, X, for districts with a known amount of infant mortahty, Y, is X- 17-86=r'^^(Y- 126-03) t.e. = 0-31(Y-126-03) X=0-3iy-21-0. 120 STATISTICS Example (3). — The reader might apply the same method to the determination of the correlation between Ratio of Indoor Paupers and Ratio of Outdoor Paupers, each measured per 1000 of the esti- mated PojJulation in England and Wales, excluding casuals and insane, during the years 1900-1914. The following are the statistics required for the purpose : — Table (28). Correlation between Ratio of Indoor and Ratio OF Outdoor Paupers, each measured per 1000 of the Population. Indoor Outdoor Indoor Outdoor Year. Paupers — Paupers — Year. Paupers — Paupers — Rate per 1000. Rate per 1000. Rate per 1000. Rate per 1000. 1900 6-9 15-8 1908 6-8 15-4 1901 5-8 15-3 1909 71 15-6 1902 60 15-3 1910 7-2 151 1903 6-2 15-4 1911 7-2 141 1904 6-3 154 1912 6-9 11-2 1905 6-6 161 1913 6-7 111 1906 6-8 160 1914 6-4 10-4 1907 6-8 15-6 The coefficient of correlation in this case comes out negative and = — -15, but it is very small and probably not significant. If it were, it would imply that as indoor pauperism diminishes outdoor pauperism increases, and vice versa. Example (4). — To find the correlation between the Number of Cattle and the Number of Acres of Permanent Grass-land in the Coal- Producing Counties of England (1915). A Government Report was consulted giving the acreage under crops and grass and the number of live stock in each petty sessional division in the country, as returned on 4th June 1915, and the counties included were those which appear in the coal-mining reports published monthly in the Labour Gazette. In each county the petty sessional divisions with the greatest and the least numbers of cattle and of acres of grass-land were noted, the numbers being written down to the nearest 1000, and, after a rough examination of the range of these variables from county to county, suitable class intervals were chosen and a table of double entrj^ was drawn up, Table (29), Avith an empty square ready for each possible pair of variables. CORRELATION — EXAMPLES 121 Table (29). Correlation between the Number of Cattle AND the Number of Acres of Permanent Grass-land in the Coal-Producing Counties of England (1915). Total Head of Cattle (expressed to nearest thousand) x^ ^2 ^3 X^ 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 Totals Mean x : lo y^ :: '5 15 2-50 0-5 : : 150 ^2 ; : ,9 ! ! :27 4 3 30 3 00 T3 5-10 : : '."(> : iz ^3 ! : : 6 I i -30 I: 3 :: 18 48 4-37 a CO o 10-15 I i :i8o :: 54 4 3 : : : 2 i j :30 33 7-04 So CO 15-20 ; 12 \\\6o '. '• * 0 : : :25 : 5 30 8-33 20-25 1 : i^S ; 0 0 : ° • 0 0 CO 1 : . 14 : 9 2 26 9-81 25-30 • 0 : : 0 S 0 ; 0 -I : 6 i i 0 : : 22 I 3 31 12-02 30-35 : -6 : : '.° •: 3 -2 : 0 2 4 1 : 12 : 6 . 4 23 15-33 a 35-40 -2 : : 0 i '^ : 16 CO 0 3 9 i3 3 . 4 1 8 16-87 c5 HO 40-45 : 0 ; 12 . 9 0 4 8 16 to 3 3 3 1 10 1900 45-50 ; 0 : 12 : 24 . 16 5 10 15 CI. . 4 . 4 1 9 20-83 •^ 50-55 : 20 : 40 . 'S 6 12 24 30 ^ 1 2 1 1 5 26-50 55-60 . 6 : 24 . 24 . 30 21 1 1 27-50 o 5 "a »2 60-65 . 21 24 1 1 27-50 65-70 . 24 27 1 1 27-50 70-75 . 27 40 3 3 32-5 75-80 1 120 n 22 ^? 1 1 2 20-0 80-85 . " . 22 Totals 76 97 54 24 14 5 5 1 276 Mean y 9 14 2013 33-24 43-33 50-00 59-50 67-50 57-5 122 STATISTICS Each petty sessional division was then considered in turn and a dot was inserted in the particular square applicable to it : e.g. a petty sessional division with 42,000 acres of grass-land and feeding 19,000 cattle would be represented by a dot in the square defined by row (40-45) and col. (15-20) in Table (29) ; x was used to repre- sent the number of cattle and y the number of acres of grass-land in any division, each expressed to the nearest 1000 units. All the dots were ultimately added in each square giving the frequency for each corresponding pair of variables, and these frequencies were recorded in the centres of the squares to which they applied : e.g. the frequency of petty sessional divisions stocking 10 to 15 thousand cattle and with 30 to 35 thousand acres under permanent grass was 22. The total frequency for each row, i.e. each array of selected y type, was also noted, in the column at the end of the rows : e.g. altogether 31 petty sessional divisions were observed of the type having 30 to 35 thousand acres of land under permanent grass. Likewise the total frequency for each column, i.e. each array of selected x type, was noted in the row at the foot of the columns : e.g. altogether 54 divisions were observed of the type stocking 10 to 15 thousand head of cattle. It was possible now to treat each column separately and to calculate the mean y's associated with different types of x, namely x^, x^, x^, . . . , and the frequencies so obtained were inserted in the bottom row of Table (29) : e.g. when x lies between 20 and 25 thousand, the mean value of y is 50 thousand. The resulting points— (a^i, y^), [x^, y^), {x^, Pa) • • ■ in the notation of Chapter x. — are plotted together in fig. (25), and they are seen to lie approxi- mately in a straight line. The successive rows were treated in precisely the same way and the mean x's calculated corresponding to y's of cUfferent types, namely y^, y^, y^, . . . , the frequencies obtained being recorded in the extreme right-hand column of Table (29) : e.g. when y lies between 45 and 50 thousand, the mean value of X is 19 thousand. The resulting points {x-^, y^), {x^, y^), (^3> 2/3). • • • . are also plotted in fig. (25), and, excepting for values which depend upon only one or two records, they too lie roughly in a straight line which is not far from coinciding with the previous one, so that we shall expect on calculation to get a high value for the coefficient of correlation. In order to calculate r we need first to find the mean and standard deviation for each variable. For this let us take as origin the point (12-5, 27-5). The essential details are shoAvn immediately below the relative Tables (30) and (31). CORRELATIOK — EXAMPLES 123 TABiiE (30). Distribution of Petty Sessional Divisions ac- COKDING to the HeAD OF CaTTLE (EXPRESSED TO NEAREST 1000) STOCKED. (1) (2) (3) (4) (5) No. of Cattle Devia- No. of Pettv Product of Product of stocked (in tion from Sessional Nos. in Nos. in thousands). 12-5. Divisions. Cols. (2) .t (.3). Cols. (2) & (4). (x) 0-5 _2 76 -152 304 5-10 -1 97 - 97 97 10-15 0 54 . . 15-20 + 1 24 + 24 24 20-25 + 2 14 + 28 56 2.5-30 + 3 5 + 15 45 30-35 + 4 5 + 20 80 35-40 + 5 1 + 5 25 276 -157 631 Mean number of cattle=12-5— 17^x5=9-66, since x= — wjl class units referred to 12-5 as origin; and (Tx=5\/[iTw~{^^i)^] = 5\/l'963=7-00. [The numbers in col. (4) may be spoken of as the first moments of the totals of x arrays and the numbers in col. (5) as the second moments.] In order to calculate easily the product deviation with reference to (12-5, 27-5) as origin, the value proper to each square was inserted just above the frequency and the product of the deviation by the frequency was inserted just below the frequency in different type of print to prevent confusion : e.g. the row (50-55) is +5 class intervals distant from the row (25-30) containing the origin, and the column (20-25) is +2 class intervals distant from the column (10-15) con- taining the origin ; hence, for the jDarticular square defined by this row and this column, the product deviation=5x2=10 ; also the frequency recorded in this square =4, so that it supplies a term 10 X 4 to the product deviation ; the numbers 10, 4, and 40 are therefore the numbers Avhich appear in the square. It is neces- sary to be careful with the signs ; if the product deviation is to be positive, the separate deviations must be of like sign, both positive or both negative : hence they must either be both above or both telow the numbers 12-5 and 27-5 respectively from which 124 STATISTICS they are measured. In this instance there are only two negative terms among the product deviations in the whole table. Table (31). Distribution of Petty Sessional Divisions ac- cording TO THE Number of Acres of Land (expressed to nearest 1000) under Permanent Grass. (1) (2) (3) (4) (5) 143 2T^ Mean number of acres=27-5— ^7^x5=24-91, since y class units ; and CTj,=5V[W/-(i4|)2]=5\/l04()2= 16-12. [The numbers in col. (4) are the first moments of the totals of y arrays, and the numbers in col. (5) are the second moments.'] It is now a simj^le matter to sum the j)roduct deviation terms, taking each column (or each row) in turn : e.g. the first column gives 150+216+180+12 = 558 ; the second column gives 12+54+60+25-6-2=143, and so on ; and, summing these results together, we get 558+143+76+126+96+160+30=1189. CORRELATION — EXAMPLES 125 But this is the sum of all the product deviations referred to (12-5, 27-5) as origin. Transferring now to the mean, we have 118 9 / 157 w 14 3 v =4-013, expressed in class units. Hence, r=plax(Ty, where a^ and (jy are also to be expressed in class units, =4-013/ V(l-963)-v/(10-402) -=0-89, a result not far from unity, so that the correlation is high. The regression of ' acreage of grassland ' (Y) on ' head of cattle ' (X) is given by (Y-24-91)=A(X-9-66) = (0.89)y^\x-9-66), (7-00) i.e. Y=2-05X+5-ll. The points representing the mean y's for x's of different types should lie close to this line which is shown in fig. (25). This equation enables us to jaredict the acreage under permanent grass to be found on the average in petty sessional divisions with a given total head of cattle in each. The words ' on the average,' to be tacitly understood even if not stated in all such cases, are emphasised because the prediction relates to the whole array of divisions of a particular type, and as it only professes to give the mean or most likely result it is not to be pronounced worthless if it fails in an individual trial with a selected division. Again, the regression of X on Y is given by (X-9-66)=r^(Y-24-91) i.e. X=0-39Y+0-05, which tells us the total head of cattle (X) to be found on the average in petty sessional divisions when the acreage under permanent grass (Y) is known. This line is also drawn in fig. (25). Example (5). — The data for this example are taken from an exceedingly interesting Government Report on the Cost of Living of the Working Classes {Rej)ort of an Inquiry by the Board of Trade into Working Class Rents and Retail Prices together with the Rates 126 STATISTICS of Wages in certain Occupations in Industrial Towns of the United Kingdom in 1912 in continuation of a similar Inquiry in 1905. Y -_ __ — ^ 7 ^ r r > 7 / r _ ^ llf\ vi\1 / * ' -12)^ „ 7 i ' I—ZaII F.__ ^ i 1 °t L I 1 t t _ ^7 "53 1 L t° J ' it i t t J ^ ^ _ -s: i 1 t °L HJ 4 J- it JSen-- _- _ t L . f- J, - « 1 I i ' J° c , ( o - it _ _Z J.- - •*- T Z^ t •?s X- itit 4 1 § i ? it -, 1 It § =o - - t T ^ uO f . "^ £5. It J L ^ r 1 V* -it - it _ _ _ t~4 - •^ It it t t "= 4' 4 ^ tU- > It S dO / / « ^-f- ^ ^2^ c3 p. •w ^'^Zt s; __ _ _ I4- _ S -- - - /2^- -H ^ ^v H ^2 S -.U S ^o - - -^ Z^ it i o" / ^ TX ^2^ •^ T i °J 0^ t to .t 'Kyi s^ jM u ^'^ it - ^ -^ i- -|- i tni -fines rf^reqr jssioirare- "+--,„„ -^y Trhe-equdtivnso/ ° // /t^ //! n /Ti V-4 l^ ,11 fe /^ ^ '^ ^ yjt -1- t(2 X^ =^0^3'9 y+O-b'5 R -^ ll l^Jl^ }l' f s: ']'" vmrciT cjypruxrm ate res pi ctiv'eh~txrlrn~es~ Tl-t-fl' -»/-«■' l>' lltl ""• /^^ iiV Meah^iXs tor -xls -opdifferent^ type^ < \ [* - (/ s of cliff ere (ft tc/pe -i-— i-a, •je .-> 7 ^ (2J /J/er/7^x's /"or ctratrttreri reanipe6724--91f ' 7 •■f We^lii esn'ntqrse ' h of the distiit iiti oil . \ . 1 I ' r it i ° 0 10 20 30 40 X Total Head of Cattle (expressed to nearest thousand) Fio. (25). Cd. 6955). Some further particulars concerning this Report will be found on p. 281. CORRELATION — EXAMPLES 127 The towTis included in the inquiry' numbered 93, but in five instances it was found desirable to consider closely adjacent muni- cipalities as single toAvus thus reducing the number of to"WTi-units to 88, namely 72 in England, 10 in Scotland, and 6 in Ireland. In the example which follows the three zones of London, middle, inner, and outer, have been treated as separate towns, so making the net number of town- units 90. This number is too small to allow any real value to be attached to our results, but the fewTiess of the observations makes them easier to deal with as an illustration of method. We begin as before by choosing convenient class intervals for the two factors we propose to consider, namely. Increment of Un- skilled Wages and Increment of Rents — by increment in each case is meant the percentage increase (+) or decrease (— ) between 1905 and 1912 — and then form a correlation table. In the last example separate tables were drawTi up to find means and S.D.'s, but that was only done in order to keep the argument clear at its first presentment : generally we may dispense with these additional tables and show all the working in one (see Table (32)). The increment of wages runs from (— 2-5) per cent, to (+11-5) per cent., so that, if we take (—0-5) as origin and a difference of 2 per cent, as miit, the classes run from (—1) to (+6), these numbers being sho^vn in different type in the table, but in the same com- partments as the others. In the fourth row from the bottom are shown the total frequencies for x arrays from class (—1) to class (+6), and in the row just below it these several frequencies are shown multiplied by their corresponding deviations measm-ed from (—0-5) as origin in terms of the class unit — the resulting numbers give the first moments of the totals of x arrays. These numbers, multipHed again by their corresponding deviations, give the second moments of the totals of x arrays, and appear in the last row but one of the table. We deal in exactly the same way with increment of rents : a percentage increment of ( — 1) is taken as origin from which devia- tions are measured, a difference of 3 per cent, is taken as unit, and the different classes then have deviations running from (—3) to (+6). The totals of y arrays, the first moments, and the second moments of these totals appear in the last three columns on the right-hand side of Table (32). To calculate the deviation products, numbers were inserted in each square on the same principle as in the last example, and the sums of these products for each x array, that is for each column, 128 STATISTICS are given in the bottom row of the table — 1, 0, 14, 6, etc., making in all a total of 126. Table (32). Cokeelation between Increment of Unskilled Wages and Increment of Rents in certain Industrial Towns of the United Kjngdom. X = Percentage Increment of Wages -I -2-5 0 -0-5 f 1 1-5 + 2. 3-5 + 3 5-5 + 4 7-5 + 5 9-5 + 6 11-5 Totals of y arrays 1st. mo- ments ofy arrays 2nd. mo- ments ofy arra'is CO -3 -10 o 1 o 1 -3 9 eg CO CO -2 -7 2 1 2 o 3 o 4 -8 \6 CD -I -4 o 4 o -I 1 -I -2 2 -4 -4 1 -4 -5 1 -5 -6 1 -6 10 -10 10 O -1 o 15 o o 1 o o 6 0 0 6 o 0 2 0 30 - - + 1 2 -I 1 -I 0 9 o I 1 I 2 3 6' 4 3 12 5 1 5 18 18 18 i + 2 5 0 6 o 2 1 Z 4 1 4 8 1 8 10 2 20 11 22 44 + 3 8 o 3 o 3 4 12 9 1 9 12 1 12 IS 2 30 11 33 99 + 4 11 o 3 o 3 12 48 1 ■+S 14 o 1 o 1 5 25 ih + 6 17 24 1 24 1 6 36 Totals of X arrays 2 45 8 12 7 7 8 1 SO 75 305 7st. moments of X arrays -2 - 8 24 21 28 40 6 125 2ncL moments of X arrays 2 - 8 48 63 112 200 36 469 Product Sums of X arrays 1 0 14 6 9 52 50 -6 126 Total Product Sum The necessary calculations are as follows : — 1. Mean a;=-0-5+2(125)/90=2-28, a^=2v'[-*9V-(-V(r)2]=2V(26585)/90. 2. Mean ?/=-l + 3(75)/90=- 1-50, <^v=3V[%V-(H)2]=3V(21825)/90. 3. p=V/— (V(f)(if)= , expressed in class units. (90) CORRELATION EXAMPLES 129 Hence r=pla^ay 1965 X 90 X 90 (90)2 y^(26585) V(21825) =0-08. In substituting for o-j. and ay to find r we have omitted the factors 2 and 3 respectively, because the S.D.'s have to be expressed in the same units as p. Alternatively, if we worked Avith a difference of 1 per cent, as unit, instead of taking a difference of 2 per cent, as unit for x deviations, and a difference of 3 per cent, as unit for y deviations, each individual product of x and y deviations would Y 1' \ ~ ■ -1 ~ -r. (■' ■1 6 - - - +g)= 1/4+3/4=1. Example (2). — What is the chance of drawing either a picture card or an ace from the pack at a single trial ? Altogether there are 12 picture cards, and the chance of drawing any one of them is thus 12 out of 52 = 12/52=3/13; and the chance of drawing any one of the 4 aces is 4 out of 52 =4/52=1/13. Hence the total probability required = 3/13+1/13=4/13. Generally, if the jDrobability of one type of event is p^, and the probability of a second type of event is p^, and if either type is reckoned a success, then the total probability of success is (^1+^2)* This evidently holds good however many different types there may be, and even if there is only one event of each type. Consider now the simultaneous happening of two events, one of which can happen in n different ways, a among which are to be regarded as successful, and the second can happen in n' different ways, a' among which are to be regarded as successful. Further, the two events are to be absolutely independent of one another in the sense that neither is to influence the success or failure of the other. What is the probability of a double success occurring ? The total number of different combinations of the two events 138 STATISTICS possible is nn' , for any one of the n possible happenings for the first event can be combined with any one of the n' possible happen- ings for the second event. Also the total number of different combinations of two successes possible is aa' , for any one of the a possible successes for the first event can be combined with any one of the a' possible successes for the second event. Hence, according to our definition of probability, the probability of a double success is aa' out of nn'=aa'/nn'={al7i){a'/n'). Thus to get the probabiHty of a double success for a combination of two independent events we must multiply together the separate probabilities for the success of each event taken by itself. Similarly, in the above case, the probability of a double failure = {n—a){n' —a')lnn' ; and the probability of one success and one failure a n'—a'n~a a' n n n n for the first event can be a success and the second a failure or the first a failure and the second a success. Here, again, if we take all the different possibilities into account, and add the probabilities corresponding to each case, we arrive at certainty, the measure of which is unity, thus : — probabiHty of 2 successes —aa'/nn', „ 1 success and 1 iai[uTe=a{n' —a')Jnn' -\-a' {n—a)/nn' „ 2 failures ={;n—a){n'—a')lnn'. Therefore total probability, all cases, aa' a{n'—a') , a'{n—a) {n—a){n'—a') = — ,+ -, — -r- — -r ; nn nn nn nn — {aa'-\-an'—aa'-\-a'n—a'a-\-n7i'—na'—an'-\-aa')lnn' =nn'Jnn' = 1. Example. — Take two packs of cards. What is the probability of drawing an ace from the first pack and a king, queen, or knave from the second pack ? Here a=4, n=52, a'=12, n'=o2 ; hence the required probability =aa7wi'=4/52x 12/52=3/169= 1/56|. Thus we might expect to succeed on the average about once in 56 trials. INTRODUCTION TO PROBABILITY AND SAMPLING 139 We proceed to discuss the case of a coin spun a number of times in succession, and we shall find the probabilities of the appearance of so many heads (H) and so many tails (T) in so many spins on the hypothesis that the coin is perfectly balanced and equally likely to fall on either side. In 1 spin there are 2 possible events, namely H or T, which we shall write simply as (H, T). In 2 spins there are 4 possible events, because we can combine the H or T of the first with an H or T at the second spin, and we may express the result thus (H, T)(H, T)=(HH, HT, TH, TT) ; the interpretation of which is that we may get either head followed by head, or head followed by tail, or tail followed by head, or tail followed by tail. In 3 spins there are 8 possible events, because we can combine the 4 events previously possible with an H or T at the third spin, thus getting (H, T)(H, T)(H, T) = (H, T)(HH, HT, TH, TT) = (HHH, HHT, HTH, HTT, THH, THT, TTH, TTT) ; the interpretation of which is that A\e may get either 3 heads in succession, or 2 heads followed by 1 tail, or head followed by tail followed by head, and so on. In 4 spins there are 16 possible events, because we can combine the 8 events previously possible with an H or T at the fourth spin, thus (H, T)(HHH, HHT, HTH, HTT, THH, THT, TTH. TTT) = (HHHH, HHHT, HHTH, HHTT, HTHH, HTHT, HTTH, HTTT, THHH, THHT, THTH, THTT, TTHH, TTHT, TTTH, TTTT). But the method here adopted to get the possible events at each stage is precisely the same as that which gives the successive terms in the ordinary algebraical expansions of (H+T), (H+T)(H+T), (H+T)(H+T)(H+T), etc. Also each new spin has the effect of doubling the number of possible 140 STATISTICS events obtained at the previous spin, and we conclude that in n spins, there are (2x2x2x . . . to w factors), or 2", possible events, and these events are given by the successive terms in the expansion of [(H+T)(H+T)(HH-T) ... to w factors.] Let us now consider the probabilities of the different events obtainable. The important point to notice is that at any stage each possible event has exactly the same probability, for there is no reason why any particular spin should give H rather than T, or T rather than H : for example, in 3 spins there are 8 possible events, each by itself equally probable, and we therefore divide the unity of certainty into 8 equal parts and assign one part to each event, thus probability of 3 heads— HHH=| probabihty of 2 heads and 1 tail— HHT=|^ HTH=i i THH=iJ probability of 1 head and 2 tails— HTT=J) THT=| i TTH=iJ probabihty of 3 tails— TTT=J. It is clear from this arrangement that, if the order of the appear- ance of H and T is indifferent, some events are of the same type and some types are likely to appear oftener than others, e.g. the probability of getting ' 2 heads and 1 tail ' (or ' 1 head and 2 tails ') is three times as great as the probability of getting ' 3 heads ' or ' 3 tails.' Hence for conciseness it is convenient to adopt the ordinary index notation and write HHH=H3, HHT=H2T, HTH=H2T, etc., so that the possible events in 3 spins are W, 3H2T, 3HT2, T3 ; in 4 spins they are W, 4H3T, 6H2T2, 4HT3, T* ; and so on. The probability of any jDarticular type is now readily written down : e.g. in 4 spins, the probability of getting 2 heads and 2 tails = (number of successful events possible)/(total number of events possible) =6/2*=6/16=i. INTRODUCTION TO PROBABILITY AND SAMPLING 141 But the binomial expansion always sums together terms of the same type for us in just the manner wanted, and we have the possible events in n spins given by the successive terms in the expansion of (H+T)(H+T)(H+T) ... to 71 factors, i.e. (H+T)«, i.e. H«+"Ci . H«-iTi+«C2H«-2T2+ . . . +T«, and therefore again the probability of any particular combination is readily written down : e.g. probabiUty of ' (w— 2) heads, 2 tails ' = (number of successful events possible)/(total number of events possible) = "C2/2«. Another way of stating the result obtained is to say that we might expect to get n heads appearing on the average about once in every 2'* trials, (n—1) heads, 1 tail „ „ ,, "Cj times „ (7i— 2) heads, 2 tails ,, „ ,, "Cg times „ and so on. K, in accord with our previous notation, we call the appearance of, say, H at any spin a ' success,' and label its probability ^ by the letter p, and if consequently the appearance of T at any spin is a ' failure,' its probability, |, to be labelled by the letter q, we have the probabilities of the different combinations of events in (H+T)", or H«+«CiH«-iTi+"C2H"-2T2+ . . . +T«, given by the corresponding terms in (p+g)", or where p=q=h After each spin of the coin in the case considered the distribution of probabilities was symmetrical, e.g. after the fourth spin the pro- babilities were 1 _* JL _* J- lF» iB^' 16> iFl l^"' We pass on now to a case where the distribution is not symmetrical, owing to the fact that p and q are no longer equal for any isolated event. Consider the throw of an ordinary die in which each of the six faces is assumed to have an equal chance of appearing uppermost. The probability of throwing, say, a 3 is 1/6, since we are certain to throw either 1, 2, 3, 4, 5, or 6 ; and the probabihty of failing to throw a 3 is 5/6, since we are certain either to throw a 3 or not to throw a 3. 142 STATISTICS If we represent the probability of success (say, in this case, throwing a, 3) hy p {i.e. 1/6), and failure {i.e. in this case, failing to throw a 3) by g {i.e. 5/6), we have ^+5=1/6+5/6=1. Bearing in mind then that the probability for a combination of two independent events is determined by multiplying together the separate probabilities for each, we have the following table shoeing what might be expected when 1, 2, or 3 dice are thrown up together, where s stands for success and / for failure : — No. of Dice thrown. Different Possibilities. Different Probabilities. 1 2 3 ss, sf. fs,ff. sss, ssf, sfs, sff, fss,fsfjfs,fff. pp,M- qp, qq. ppp, ppq, pqp, pqq, qpp, qpg, qqp, qqq. The table is easily extended on the same principle, and at each step, it will be noticed, a fresh pair of possibilities, s or /, is intro- duced, with corresponding p or q, to be combined with what has gone before. If the order of appearance of 6- and / is a matter of indifference, e.g. if it does not matter whether the first die shows s and the second /, or vice versa, so that results of the type sff and fsf may be regarded as equivalent, we may use the index notation, as in the coin case, to render the table more concise, thus : — No. of i Dice ! thrown. , Different Possibilities. Corresponding Probabilities. 1 s, f. p, q. 2 s\2sf,p. I p\2pq,q^. 3 S\ 3S2/, 35/2, p^ p3^ 3p2^^ 3p^2^ g3 When, therefore, n dice are thrown we again recognize the different possibiUties as given by the successive terms in the ex- pansion of (5+/)'*, namely s«+«CiS"-^f+"C2.s«-^/'^+ ...+/«, and the corresponding probabilities by the successive terms in the expansion of (p+g')", namely p''+«Ci2)'»-Y+"C2P«-V+ • • • +9''*- INTRODUCTION TO PROBABILITY AND SAMPLING 143 Hence the probability of throwing n threes =2)** =1/6" ; (n-1) (n-2) _ 1 ^ ~^' 6^^ '6 =5ri/6" ; n{n—l) 1 1-2 =25w(w-l)/2.6"; and so on. The result we have just obtained is of perfectly general applica- tion. Whether we spin 7i coins, in which the probability, p,^ of success (say * heads ') for each is 1/2, or throw n dice, in which the probability, p, of success (say ' to get a 3 ') for each is 1/6, or have any n similar but independent events happening in which the probabiUty of success for each is p, the different resulting possi- bihties as to success are given by the successive terms in the expan- sion of (s+/)"j and their corresponding probabilities are given by the successive terms in the expansion of {p-\-q)^. We are thus in a position to form a frequency table, Uke that on p. 53, showing the probabiUties of getting 0, 1, 2 ... % successes (in other words, the proportional frequencies of these different numbers of successes) at the occurrence of n similar independent events, where p is the probabiUty of success for each and q is the probability of failure : — (1) Table (35). Binomial Distribution. (2) (3) (4) Number of Successes. (X) 0 1 Frequency. (/) q- n(n-l) .^ 1-2 ^ ^ n(» -!)(«- 2)^ _ 3^3 1-2-3 Product of Nos. in Cols. (1) & (2). ifx) 0 Product of Nos. in Cols. (1) & (3). 1-2 np" np 0 2n(»-l)g"-V^ 3n{n-l){n-2) , , F2 ^ ^ n^p" np[l+p{n-l)] 144 STATISTICS Col. (1) gives the deviations from the origin of measurement, which in this case is taken as ' no successes,' the class interval being equal to a difference of 1 in the number of successes. The summations of the last three columns are effected as follows : — Col. (2). g"+ g-V+^Y7^V-V+ . . . +2>" = 1, because p-]-q=l. Col. (3). =np. Col. (4). =wjp[l+^(w— 1)]. The arithmetic mean of the distribution =sum of terms in co}. (3)/sum of terms in col (2) =np. INTRODUCTION TO PROBABILITY AND SAMPLING 145 The mean-square deviation referred to zero as origin, zero in this case corresponding to ' no successes ' =sum of terms in col. (4)/sum of terms in col. (2) =np[l-^p{n—l)]. Thus the standard deviation, a, is given by a^=np[l-\-p{n—l)]—x^, where x is the deviation of the mean from the origin of measure- ment, so that x=^np. Therefore a^=np[l-\-p{n—l)]—n^p^ =np{l—p)-\-n^p^—ri^p^ =npq. Hence ct='\/(^P4)» and p.e.= 0-6745 A/(npq). These two results are exceeding^ important, and it is essential to understand what it is they measure. An example may help to make this clear. If we spin 300 coins, counting ' head ' for each a success, the number of heads we shall get will be unlikely to differ very greatly from the average or mean number of successes, np, i.e. 150 if p=l/2 for each coin, and in the long run, if we repeat the experiment a great number of times, we shall get a proportion of about 150 heads to every one experiment. Again, if we throw 300 dice, counting every throw of the number 5, say, for each die a success, so that p in this case=l/6, the number of fives we shall get will be unlikely to differ much from np, i.e. 50, and in the long run, if we repeat the experiment a great number of times, we shall get on the average a proportion of about 50 fives to every experiment ; we should find, for example, something like 5000 fives if we threw 300 dice 100 times in succession. The a.rithmetic mean of the distribution tells us therefore about what number of successes to expect in one experiment with n events if n is fairly large, though we should be unlikely to get exactly this number if we confined ourselves to the one experiment. The second result, the S.D., supplies us with a measure of the unlikelihood of getting the exact number of successes expected at any single experiment, for it defines the dispersion of the different numbers of possible successes about their average. Clearly the greater the dispersion, the greater is the likelihood of missing the K 146 STATISTICS average. The mean number of successes when an experiment is repeated a great number of times is np, but at any single experi- ment it is not unlikely that the number of successes obtained may dififer from np by as much as 0-6745 \/{npq) in excess or in defect ; it is, however, unlikely, as we shall see later (p. 244), that the number will differ from np by more than S\^{npq) in excess or defect when the distribution is not very skew, or unsymmetrical, especially if n be large. The probable error in the case above when we throw a sample of 300 dice is =0-6745 V(300x 1/6 x5/6)=0-6745\/(41-67)=4-4, and it is therefore quite likely that the number of fives obtained at one experiment will differ from the expected number, 50, by as much as 4 or 5 in excess or defect, but it is unlikely that the number will fall outside the limits 50i:3v/(41-67), say 30 to 70. It is sometimes more convenient to refer to the proportion of successes, etc., expected at any experiment rather than to the actual number expected. In that case, since with n events the expected number of successes is pn, but the number obtained may quite likely differ from this by ±0-6745-v/(npg'), therefore with n events the expected proportion of successes is pn/n, i.e. p, with quite possibly an error=±0-6745\/(wpg)/7i, i.e. :^0-Q74i5\/{pq/n). Thus, with the 300 dice, the expected proportion of successes at one experiment lies between [l/6-0-6745V(l/6x5/6-^300)] and [1/6+0-6745 v/(l/6x 5/6^300)] i.e. (1/6-0-6745/46-5) and (1/6+0-6745/46-5) i.e. 1/5-5 and 1/6-6 ; and it is unlikely that the proportion wOl differ from 1/6 by more than 3/46-5, i.e. 1/15-5. To illustrate how the binomial distribution might be directly applied, an experiment was made with 900 digits selected at random by taldng in succession the digits in the seventh decimal place in the logarithms of the following numbers : — 10054, 10154, 10254, . . . 99954, as given in Chambers's Mathematical Tables. In this way each of the 10 digits, 0, 1, 2, 3 ... 9, may be supposed to have stood an equal chance of selection each time one was \M"itten down. Gaps of 100 were left between the numbers selected so as to avoid runs INTRODUCTION TO PROBABILITY AND SAMPLING 147 of the same figure which sometimes occur even in the seventh decimal place owing to lack of independence. The digits w.ere arranged in 36 columns, each column containing 25 digits, and in this way we obtained what was equivalent to 36 separate but like experiments with 25 events each. If we agree to regard the appearance of a 7 or an 8 as a successful event, and the appearance of any other digit as a failure, the chance of success at any appearance is 2/10, and the chance of failure is 8/10. The case is thus of exactly the same kind as that of throwing 25 dice 36 times in succession, and if the probability of success, namely 1/5, for each independent event, be denoted by p, and the probability of failure, namely 4/5, by q, the distribution of successes and failures should approximately conform to that given by the expansion of for any particular experiment, and since the experiment was re- peated 36 times, the total numbers of successes and failures of different orders obtained should approximately conform to 36(^+g)25, for if the probability of an event is p the number of events to be expected in N trials is Np. The actual distribution observed is compared with that given by the binomial expansion in Table (36). Col. (2) is obtained by picking out the appropriate terms in the expansion of 36{p-\-q)^'^, where p—l/5, gf=4/5 ; this expansion is / 25 25-24- \ 36^=^5+^. p2V+y7|^^""3'+ . . . +q''j. Thus, 5 successes occur 36 ' — - »5«2o 1 • 2 • 3 . . .20 times, and this equals 7 06, or approximately 7. The mean number of successes by theory=w^=25/5=5. The mean by trial, since it is measured from zero as origin, the numbers in col. (1) being the deviations, =2'(/a;)/i;(/)- 162/36=4-5. The standard deviation by theory = V(«P?)=V(25xixl)=2. 148 STATISTICS Table (36). Distribution of Stjccesses (getting a 7 or 8) in THE Random Choice of 25 digits 36 times in succession. (1) (2) (3) (4) (5) No. of Successes. Frequency Frequencv Product of Product of by Calculation. by Experiment. Nos. in Cols.(l)&(3). Nos. in Cols.(l)&(4). {X) (/) (/•^) ifx') 1 1 1 1 1 2 3 5 10 20 3 5 5 15 45 4 7 7 28 112 5 7 9 45 225 6 6 4 24 144 7 4 3 21 147 8 2 0 0 0 9 1 2 18 162 36 36 162 856 By trial, the mean square deviation, measured from zero as origin ==856/36. Thus the S.D. by trial= V(-V/-«'), where x is the deviation of the mean from the origin, = ^[856/36- (4-5)2 = 1-88. It will be seen that not one of the 36 experiments gave a number of successes differing from 5, the theoretical mean, by more than twice the S.D., for the number ranges only between 1 and 9. If we treat the 900 digits as 900 separate experiments with one event each, instead of treating them as 36 experiments containing 25 events each, we have 1/10 as the chance for the appearance of any particular digit, and hence the number of times any digit may be expected to appear =n2)±f-v/(%^g), approximately = (900)tV± I V(900 X tV X TO ) -90±6. The actual number of occurrences of each digit was as follows : — Digit .... No. of Occurrences 0 95 1 96 2 93 3 105 4 91 5 80 6 82 7 72 8 90 9 96 INTRODUCTION TO PROBABILITY AND SAMPLING 149 SO that the digit 7 showed the greatest divergence from 90 of any, and this was only just three times the probable error. [The Theory of Probability is older than that of Statistics. Todhunter, in his History, states that ' writers on the subject have shown a justifiable pride in connecting its true origin with the great name of Pascal.' The well-known story of the latter being found, as a lad of twelve, tracing out on the hall floor geometrical propositions which he had evolved in his own head is not. to be wondered at, nor yet that at sixteen he wrote a small work on Conic Sections, when one reflects upon the fame he was to win as a philosopher and writer, as well as a mathematician, in his too brief life of thirty-nine years. He was born in 1623 of a distinguished French family, and for the last half of his life he suffered from the effects of a serious disease which contributed to turn his attention from mathematics to religion and philosophy. We learn from Todhunter how a certain gentleman of repute at the gaming tables set Pascal pondering on a question of probability concerning the fair division of stakes between two players who give up their game before its con- clusion— an old problem cited in a work by Luca Pacioli as early as 1494. A correspondence followed between him and Fermat, then probably the two most distinguished mathematicians in Europe, and so began a science which has fascinated at one time or another all great mathematicians from that day to this. The illustrious family of the BernouUis, friends of Leibnitz, who championed his claim against that made by English mathematicians on behalf of Newton to the invention of the Calculus ; De Moivre, an exile in England, owing to the revocation of the Edict of Nantes ; Euler, Lagrange, and Laplace, who worked out in algebraical form Newton's theory of gravitation for the motion of the planets — all these had a share in building up the science of ProbabiUty, often by investigating problems in games of chance, where the conditions can be made mathematically perfect, so by careful analysis preparing the way for the use later of the same principles in matters of greater importance. It has been said that the development of the subject owes more to Laplace (1749-1827) than to any other mathematician ; nor did he confine himself to its theory : he would have earned fame by his astronomical applications alone. His method was to take certain observations, and to determine by means of probability whether the abnormalities present were merely the results of chance or whether there was some as yet undiscovered but constantly acting cause behind the phenomena observed. In this way he was led to highly interesting and important results such as those relating to the theory of the tides, the effect of the spheroidal shape of the earth on the motion of the moon, the irregularities of Jupiter and Saturn, and the laws which govern the motion of Jupiter's moons. It needs but a step in thought to pass from the dis- cussion of such physical data to the statistics of social phenomena and the causes which determine abnormalities met with in that field. Professor Edge- worth, in making reference to books that have been WTitten on Probability at the end of his excellent article under that heading in the Encyclopcedia Britannica, remarks that ' aa a comprehensive and masterly treatment of the subject as a whole, in its philosophical as well as mathematical character, there is nothing similar or second to Laplace's Thiorie analyfique des probabilifes.'J CHAPTER XIII GENERAL POPULATION. SAMPLING [continued) — formula for probable errors So far we have only considered the most simple case of random samjjling when we take a sample of n independent events each of which falls into one of two classes according to its nature, the chance of entering either class being the same for every event : we have dealt, that is to say, more particularly with non-measurable characters. We pass on now to measur- able characters which are distributed among several classes according to their size, so that a frequency distribution table can be set up for each sample ; and assuming that the population from which the samples are drawn is homogeneous, the samples themselves containing each an adequate number of individuals, there should not be greater differences between one table and another than can be ac- counted for by random sampling. It is our object to discover how great such differences may be. Given a homogeneous population of N individuals which we will suppose could be distributed into a number of groups, Yj individuals in the first group, Y2 in the second group, Y3 in the third, and so on, according to the size of the organ or character under observation. Suppose a random sample r)f n individuals be taken from this population, and when they are assigned to their several groups let the frequency table now take the form shown, with Pi individuals in the first group, y^ in the second, and so on. To find the probable error of y^, the frequency observed in the kth group. 150 Class. Frequency. Yi Y2 Y, 1st Group 2nd Group A;th Group N sample. Class. Frequencj-. 1st Group 2nd Group kth Group 2/2 Vk n SAMPLING — FORMULA FOR PROBABLE ERRORS 151 Consider the selection of the n individuals, one by one in succession, to form the sample. When the first choice is made the probability that we shall get an individual falling into the A;th group is, by defini- tion, Y;fc/N, and the probability will remain practically the same for each successive choice granted that N is considerable. We have thus n independent events, the chance of success (falling into the A;th group) for each being p(=Yfc/N) and the chance of failure being nl =1— — -]. The case is therefore analogous to the one pre- viously considered to which the binomial distribution is applic- able, so that the frequency to be expected in the ^th group is np with S.D., Oy =Vnpq ; i.e. yk=np with a p.e.=0-6745Vwi>g'- Now in practice the numbers Yj, Yg, Y3 . . . would not be known, and hence the true value of p would also be unknown, but since yk=np, approximately, when the sample is of adequate size, we shall get a fair idea of the probable error involved by taking p—yjn, where y^ is the actual frequency observed in the A;th group. . (1) Hence, CT^,^=wpg=i/fc(l— ^)=yk(^l— - and the frequency in the kth group =y,±0-6745^ (2) The size of the S.D. is under ordinary conditions a test of the adequacy of the sample, for the frequency in the kth group, if due simply to random sampling, should not differ from its expected value by more than 3cr„ and ct„ should therefore be small compared with y^. itself. To find the correlation between the frequencies in any two groups of a sample distribution. Let the expected frequencies in the various groups of the sample be denoted by y^, y^, . . ., yj^, . . ., and suppose an error Sy^ in y^. is associated with errors Sy-^, Sy^, . . . , Sy„ ... in y^, y^, require then the correlation between yj^ and y^ Class. Expected Frequency. Observed Frequency. 1st Group 2nd Group kth Group sth Group 2/2 y> 2/2 + ^2/2 yk+^yk y.+^y. n We 152 STATISTICS Now although the group frequencies may change relative to one another, the total sum of frequencies in all groups is not affected, because the n individuals of the sample make up its composition in each case : to keep n constant the group frequencies must adjust themselves accordingly, which explains the correlation between them. Hence to compensate for an excess, Syj^ (assuming §2/^+'^'^), of frequency in any one group there must be a defect (— St/^^) shared among the other groups, and the fairest way of sharing will be in proportion to the expected frequencies in the several groups. But the total frequency divided between groups other than the kth is (n—yj^), so that the proportion of (— S^/fc) due to the sth group is yj{n-yj,), thus Vs S2/,= -^^(-8y.). Therefore, S^/fc • 8^3 8y\ n 2/. 1- Vk •Vic n ct2 Vk . (3) by (1). FIRST SAMPLE. Size of Organ or Character observed. Frequency of Observations. First Moment. Second Moment. ^1 3/1 y-i Vk ^kVk ^\yi ,. x\yk n 2(X2/) ^x^y) This gives the product moment of the deviations from yj^ and y, in one particular sample ; summing for all such samples, remem- bering that by definition the coefficient of correlation between ?/^ SAMPLING — FORMULA FOR PROBABLE ERRORS 153 and 2/s is ^y y ^^i^Vk • ^ys)h'^v ^v > where v is the total number of samples, also a'^y =I!Sy^,./i>, we have vr cr„ CT„ =• Vuv^ Vi. y V -Vk- Therefore, 1 Yi^y, . (4) gives the correlation required. To find the p.e. of the mean of a sample of n observations. Let a frequency table be drawn up in the usual manner showing the number of observations 2/i> 2/2 • • • corresponding to organs of different sizes x^, x^ . . . The mean referred to some fixed point as origin is then given by also the mean square deviation of the sample referred to the same fixed point is i^-^, say, given by and H-\—W-^a'^ where a is the S.D. of the sample. For another sample of the same size the frequency distribution SECOND SAMPLE. Size of Organ or Character observed. Frequency of Observations. First Moment. Xk y-i + ^v-i Vk+^Vk ^iiVl + %2) ^k{yk + ^Vk) n My+^y) may be slightly different, say, Vi+hij^, 2/2+^2/2' • • •> ^^i^ conse- quently the mean will also be different, say, M+8M=[a:i(t/i+82/i)+X2(2/2+8t/2)+ . . . ]/n. 154 STATISTICS ■^and, by subtraction, 8M=(^i8yi+a;28t/2+ ■ • •)/» • • • (5) Now we want to determine the S.D. of the different values of M found among the different samples, and that is given by where 2J denotes summation for all samples and v is the number of sainples. This suggests that we should square both sides of equation (5), getting n^ . m^ =- x\Sy\-\- . . . -\-2x^XoSyiSy2+ • . . TheTeioTe,n^ .va\^x\va^yi+ . . . -\-2xixJ-'^.-v)-\- . . ., by (3). Hence, n-aking use also of (1), ■ yi\ , 2^i?/i . x^y^ nV2M=^Wl-|) + -{x\yx+ . . . )-hAy\+ . . . +2x,y, . x^y,-{- . . .) n =nf^\—{Xiy^-\- . . .)^ n =n{H-%-W). Thus a\={f^\-M^)/n=a^/n, and the probable error of the mean =0 •6745a/ Vw . . • (6) The p.e. in the arithmetic mean found by taking a random sample of n events is a measure, so to speak, of the failure to hit the absolute mean, and it follows that the precision of the sample, the accuracy of aim at the mean, would be not unfairly measured by some quantity proportional to the reciprocal of the above expression, namely, \/w/0-6745o-. With such a measure the precision would evidently be increased if the number of observations in the sample were increased, being proportional to the square root of their number. [It is desirable to draw a distinction here between what have been termed biassed errors and unbiassed errors ; errors due to random sampling are of the second class for there is, by hjrpothesis, no [* We do not know the true mean for the population as a whole, but we take in place of it M, the value given by the sample, which we may do with little error if n is large. Similarly (t is the S.D. of the sample.] SAMPLING — FORMULA FOR PROBABLE ERRORS 155 reason Avhy they should be in one direction rather than in another. Biassed errors, however, all tend to be in the same direction and they may arise in different ways, e.g. they may be due to faults of omission or commission on the part of the observer himself : he observes either carelessly or badly, omitting certain factors which ought to be taken into account, or so measuring or classifying his results that they appear always larger or less than they really are in fact. Sometimes, although the bias is known to exist, it may be im- possible to correct it : the most one can do is to bear it in mind and allow for it in using the results. A familiar example of this occurs in the collection of household budgets from the poor to find their standard of living, where it is only possible to get particulars from the more intelligent and thrifty class among them. Whereas in the case of unbiassed errors due to random sampling we can diminish the probable error of the average by increasing the number of observations, the same is not true of errors which are biassed, for suppose an error e in excess be made in each of n observations x^, x^, . . . a:„, the effect upon the average is to increase it from Xi+x^+ . . . +a:„ (a:i+e)+(a:2+e)-f . • . +(^„+e) tQ , n n i.e. from •^11 •''2 1" • • • -rX„ Xi-\-X2-\- . . . -\-Xji to -f-e, n n so that the average is over-estimated by precisely the same amount. If, therefore, we know that bias exists, it is well, if possible, to correct it in each observation, for by so doing we change biassed into unbiassed errors, and though our corrections may be somewhat wide of the mark, the resultant error will then be diminished by increasing the number of observations : e.g. a farmer offers 400 sheep for sale and, being anxious to make a good bargain, he asks a higher figure for them than he is in reality prepared to take ; let us suppose that this excess is 2s. 6d. for each sheep, then clearly the average price per sheep at which he is prepared to sell will be less than the amount he asks by 2s. 6d. also. But now suppose the buyer, a simple person knowing little of the prices of sheep and less of the ways ^ of men, goes through the flock one by one and makes the error of offering a price either much above or much below what the seller is prepared to take ; even if his unbiassed offers 156 STATISTICS difiFer by as much as 10s. for each sheep from the seller's reserve price, so long as they are random in direction, i.e. sometimes too much and sometimes too little, the resultant difference in the average from what the seller is prepared to take will probably not greatly exceed §10s./V400, or 4d. per sheep. We can sometimes diminish the effect of bias, even when its extent is unkno\^Ti, by working with the ratios of the quantities affected instead of with the quantities themselves : e.g. suppose biassed errors, ei and €.2, enter into the measurement of the variables Xj^ and X2, both in excess, the ratio of the variables then = K+ei)/(a;2+e2) =^(i+l^)(i+: =— ( l + -^)( 1— — +higher powers of €0) -1 =5l+f5. ^2 if we omit higher powers of ej and eg than the first on the under- standing that they are both comparatively small. Suppose, for example, there was an error of 5 per cent, made in measuring x-^ and an error of 3 per cent, of like sign in measuring x^ then the resulting error in x.^\x^ would be 5 per cent. — 3 per cent. =2 per cent. Clearly the same holds good also if the errors are both in defect. This explains why a comparison of results arranged, say, on the index number principle may be trustworthy, although the method of formation of the numbers themselves may be in some respects faulty, granted that the same faults are repeated each year so as to produce like errors, i.e. the bias is to be unchanged in character. To correct the faults in one case and not in the other would prejudice the success of the method, since it depends upon the errors counter- acting one another.] Example (1). — To illustrate the important result we have obtained for the p.e. of the mean of n observations let us return to the experi- ment of selecting 900 random digits. The distribution actually obtained, and the theoretical distribution to be expected in the SAMPLING — FORMULA FOR PROBABLE ERRORS 157 long run if the experiment were repeated several hundred times and the average taken, are shown in the following table : — Table (37). Distribution of 909 Random Digits. Digit. Frequency Observed. Theoretical Frequency. Digit. Frequency Observed. Theoretical Frequency. 0 1 2 3 4 95 96 93 105 91 90 90 90 90 90 5 6 7 8 9 80 82 72 90 96 90 90 90 90 90 It is a simple matter to calculate the mean and S.D. for the dis- tribution from this table in the usual way ; the results are : — Observed mean=4-38 ; S.D.-2-911 Theoretical mean=4-50 ; S.D.=2-872. Thus the p.e. of the mean based on the sample = ±0-6745 X 2-911/^/900 = ±0-065, and 4-38 differs from 4-50 by less than three times the p.e. The 36 averages of samples of 25 events apiece were also calcu- lated, and the following were the results obtained : — 2-76, 3-32, 3-68, 3-72, 3-72, 3-72, 3-76, 3-80, 3-92, 3-92, 408, 4-12, 4-16, 4-16, 4-16, 4-28, 4-36, 4-40, 4-40, 4-40, 4-44, 4-60, 4-64, 4-68, 4-72, 4-72, 4-76, 4-88, 4-96, 5-00, 5-00, 5-00, 5-08, 5-28, 5-40, 5-72. The mean of this distribution= 157-72/36=4-381, and the S.D. =0-612. But the S.D. of the whole distribution of 900 digits =2-911, and therefore the S.D. of the distribution of averages of samples of 25 digits should be 2-911/'v/25=0-582, differing from 0-612 by about 5 per cent. To find the p.e. of the sum or difference of two variables. Let the mean values of the two variables be denoted by y and z, so that deviations from these values found in a particular sample may be denoted by hy and hz. If then we wTite u=y-\-z we have 8m=S?/+8z (7) 158 STATISTICS To find the S.D. of u we therefore require S{hu^)jv, where the Z denotes summation for all samples and v is the number of samples. But, squaring both sides of equation (7), we have 8w2=§y2_|_§22_|.282/8z. Thus Zhu^=Ehy'^+Shz^+22:{hyhz), where the summation extends to all samples. Hence iCT\= va'^y+ va\-\-2vay(T^ry^ or a\=a%-^a\-\-2Tj,a^a, where r„. is the correlation between the variables. And the p.e.=0-6745cT„. The p.e. of the difference of two variables follows at once by changing the sign of z throughout ; for, if v=y-z, we have Sv^=Sy^~\-hz^—2SySz, and a\=a%-\-a\—2T^^aya^. Generally, if Xj^, x^, . . . a;„ be the mean values of n variables, and if hx^, hx^, ■ . ■ Sx^ denote deviations from these values in a particular sample, we may write U = Xj^ + X2-{- . , . . -\-x„ and 8w=8.ri+8.r2+ . . . +8a;„. Thus 2'8w2=i;8.ri2+ . . . +2Z{dx,8x,)+ whence <=o\+ ■ • . +2r, , o-^ (7, + . Important Corollary. If y and z are quite independent so that Ty^ is zero, the p.e. of their sum and the p.e. of their difference have the same value, namely, the square root of the sum of the squares of the p.e.'s of y and z themselves, which =0-6745v/(a\+a\) . . . (8) This result is exceedingly important, because it can be directly used to test whether a difference between two samples is accidental, i.e. whether it is such as might arise through sampling, or whether it implies a real difference between the two populations from which the samples are selected. An example will illustrate the pro- cedure : — Example (2). In a study of Minimum Rates in the Tailoring Indtistry, by R. H. Tawney, a table is given (p. 114) which suggests SAMPLING — FORMULA FOR PROBABLE ERRORS 159 that ' in the north of England women work in the tailoring trade when they are young ... in London and Colchester they have to work when they are older.' Taking some figures from that table we find : — District. Workers over 36 years old. Workers at all ages. Proportion over 35. London and Essex Manchester and Leeds . 11,718 4,029 35,316 21,822 0332 0185 The difference between the proportions over 35 years of age = (0-332-0185) = 0-147. Let us suppose for the moment that this difference is not significant of any real difference in conditions between the two districts, but 18 merely due to random sampling. In that fcase the most natural value to assign to the true proportion of women workers over 35 for the trade as a whole, as given by these figures, would be 11,718+4,029 _15,747_,.g^^ 35,316+21,822 57,138 The S.D. for the first sample (London and Essex) would then be a^== ^/{pq/n)^ ^[0-216 X 0-724/35,316], and for the second sample (Manchester and Leeds) would be (72= VL0-276X 0-724/21,822]. Hence the p.e. for the difference between the proportions in the two samples would be roughly = tV'(<7'i+^'2)> by (8), = |V'[0-276 X 0-724(1/35,316+ 1/21,822)] = 1 V'[0-276 X 0-724/13500] ==00026. The actual difference between the proportions, 0-147, being much more than 3(0-0026), is certainly significant of a greater difference between the two populations than can be explained by random sampling alone. 160 STATISTICS Another method of attack would be to assume a real difference between the two populations, if other considerations led us to suspect such a difference, and to find whether such a difference could be disguised by random sampling. In that case the proper pro- portion to assume for the first sample would be 0-332, giving C7j= V[0-332 X 0-668/35,316]= ^628/10*, and for the second sample the proportion would be 0-185, giving (j2= V[0-185 X 0-815/21,822]= ^691/10^. Hence the p.e. for the difference between these two proportions due to random sampling would be = WK'+^2'), by (8), = fI^V(628+691) =0-0024. The actual difference is 0-147, which certainly could not be out- balanced by an error in the opposite direction due to random sampling, because it is much more than three times the probable error due to sampling. Sometimes we have to test the difference, not between two simple proportions, but between two sample distributions. In that case the mean of each sample may be calculated so that the difference (M^— Mg) between the means is known ; to find out whether or not it is significant of some real difference between the two populations from which the samples are drawn, (M^— Mg) is compared with its p.e., namely 0-6745a/Kmi+ct\2), or 0-6745 V((T2i/Wi+(T2,,/r?.2) . . . (9) where n^ and %2 are the numbers of observations in the two samples respectively, and a^, 02 are the S.D.'s of the samples. Unless (Mj— M2) is definitely greater than some two or three times this expression we cannot be verj' sure that the difference between M^ and Mj may not have arisen merely through random sampling, and it may quite likely not be significant * of any real difference between the two populations as regards the organ or character which is under consideration. [* It should be observed that the S.D. provides a wider margin for significance than the p.e., because a range of approximately 3 p.e. =3'§ff = 2een compared ; but the mean alone will scarcely serve to establish the identit}'' of any population. For example, we can conceive of two distinct races of men, both of the same mean height, but one race embracing a number of giants and dwarfs. Of course if we agreed to define two races as identical when they have the same mean heights, there would be nothing more to be said, but that would certainly only be a very rough-and-ready attempt at classification. Taking into consideration only the character of height, a further step in definition would be to measure the mode or most fashionable h 162 STATISTICS height, and the dispersion or variability — absolute : the standard deviation, and relative : the coefficient of variation— of the two races. Then, after comparing heights with sufficient detail, the attention could be turned to innumerable other characters, skull and body measurements, physical, mental, and even moral attributes. Clearly the difficulty of definition and of estabhshment of identity grows as we pass along the scale from physical to moral. Moreover, other statistical constants must be requisitioned when the question of the existence and degree of relationship between two organs or characters is to be determined. As the S.D. and the C. of V. serve to measure the amount of variability, so the coefficient of correlation comes in to measure the amount of likeness or association. Further, and especially in problems of inheritance, the coefficient of regres- sion must be measured. It might seem at first sight hopeless to try and measure the correlation between two such characters as athletic capacity and health in the same boy, or between the truthfulness of one boy and that of his brother ; but the genius of Karl Pearson has gone some way to solve even this difficult problem by means of a system of adjectival instead of numerical classifica- tion [see Phil. Trans., vol. 195a, pp. 1-47, On the Correlation of Characters not Quantitatively Measurable, and, as an exceptionally interesting application of the method, see Pearson, On the Laws of Inheritance in Man, ii. ; On the Inheritance of the Mental and Moral Characters in Man and its Comparison with the Inheritance of the Physical Characters; Biometrika, vol. iii. pp. 131-190]. In short, for a full and exact definition of a population of any kind, human or otherwise, it is necessary to measure not only the means, but all the more important statistical constants, modes, medians, S.D.'s, C.'s of v., coefficients of correlation and regression, and so on, and it is no less necessary to calculate also their probable errors if we are to test the real significance of such differences as are observed in these constants between two samples from the same or from different pojoulations. The probable errors for the more important constants, some of which are only introduced later in the book, are collected together in Table (38) for reference. The proofs in general are a little intricate and would be lacking in interest to the ordinary person, who is satisfied to take algebraical analysis on trust so long as he under- stands the nature of the results he uses, but the more mathematical reader who is anxious to see proofs may refer for some of them to Biometrika, vol. ii,, pp. 273-281, Editorial, On the Probable Errors SAMPLING — FORMULA FOR PROBABLE ERRORS 163 oj Frequency Constants, which has been freely consulted on the subject here. The usual notation is adopted, n being the total number of observations in the given distribution, supposed normal in general, a the S.D., etc. Table (38). Probable Error.s of Statistical Constants. statistical Constant. Proba 0-674 ble Error (=0G745S.D.). Any observed group frequency, y 5x V[y(l-2/A0] The mean of a distribution of any type „ » V2v\- MOO/ J The coefficient of correlation, r » (l-r^)/Vn The correlation ratio, rj . . . . „ {l — if)/Vn, nearly fx, as determined from (X-X) = ?-'^'(Y- Y), V when Y is given ..... , cr^Va-r') Y, as determined from (Y-Y) = r^^(X-X), ^ when X is given ..... Distance between mode and mean in a skew distribution ...... Skewness ...... /3., (which should = 3 for a normal distribution) ^i( „ „ =0 „ „ ) ^K J5 fr.Vd-r^) (rV(3/2u) V(3/2n) 0 V(6/n) Example (4). — In the example which follows are given data necessary for testing the significance of differences in variability as well as in mean values. They represent an attempt made to find whether members of a particular species of crab caught in shallow water differed with regard to certain characteristics from those caught in comparatively deep water [see Biometrika, vol. ii., pp. 191 et seq., Variation in Eupagurus Prideauzi, by E. H. J. Schuster]. Only a few of the results are recorded here, to two decimal places ; the reader will find it a valuable exercise to verify for himself the p.e.'s given in each case. 164 STATISTICS Measurement Made. Sex. Locality. Mean (mm.). S.D. (mm.). - C. of V. per cent. Carapace length Male Female Deep water Shallow ,, Deep ,, Shallow ,, 8-59 ±0-05 841 ±0-04 7-54 ±0-03 7-12db0'02 l-67±0-04 1-19±003 0-94 ±0-02 0-8G±0-02 L945 l249|o.28 l2-12±0-25 Difference of Means (mm.). Difference of S.D.'s (mm.). Difference of C.'s of V. per cent. Sex. 0-18±0-07(poss. sig.) 042i:0-04(sig.) 0-18d-0-05(prob.sig.) 0-08±0-03(poss. sig.) l-70db0-58(poss. sig.) 0-37±0-37 (not sig. ) Male Female The significance or otherwise of differences between variabihties in the case of cuckoos' eggs (p. 161) might be tested in the same way. CHAPTER XIV FURTHER APPLICATIONS OF SAMPLESTG FORMULA We have been discussing in the last chapter how to test two samples, supposed each to contain homogeneous material, to find oiit whether they belong to the same or to different types of population, but the further question often arises as to whether a sample is or is not homogeneous. Example (1). — To this we may obtain a partial answer by working out the statistical constants of the sample and their p.e.'s in order to compare them with the corresponding constants for a sample or series of samples believed to be homogeneous and of the same type. For example, Professor Karl Pearson has measured the skulls of skeletons of the Naqada race, excavated in Upper Egypt by Professor Flinders Petrie and presumed to be some 8000 years old, and he places his results for comparison alongside those for certain other races admittedly homogeneous [see Biometrika, vol. ii., p. 345, Homogeneity and Heterogeneity in Collections of Crania'] : — c, _...-.„ Number of Observations. Variability (mm.). Skull Length. Skull Breadth. Skulls ^ Living head.s 'Ainos Bavarians . Parisians Naqadas ,EngUsh Cambridge undergrad'tes English criminals ^Oraons of Chota Nagpur 76 100 77 139 136 1000 3000 100 5-936 6-088 5-942 5-722 6-085 6-161 6046 5-916 3-897 5-849 5-214 4-612 4-976 5-055 5014 4-397 Mean Variability 5-987 4-877 1U5 166 STATISTICS The S.D. of the variability of skull length calculated from this series=0-129 mm. and of the variability of skull breadth=0-545 mm., and these supply standards for valuing the differences between the Naqada and the mean variabilities. Another method of procedure is to take a random sample out of the sample itself, assuming the latter is large enough to admit of an adequate sub-sample, and to compare the constants of the whole and part. When they do not dififer beyond the limits allowed by random sampling the inference is that the whole may be treated as a homogeneous class if judged by this test alone. Example (2). — In an interesting and important memoir, On Criminal Anthropometry and the IdentificaHon of Criminals, by W. R. Macdonell [Biometrika, vol. i., pp. 177 et seq.], the author uses this method to test the homogeneity of a class of 3000 criminals by measuring also a random sample of 1306 criminals out of the 3000. He obtained, for example, S.D. of head length--6-04593i0-05265 mm., for the 3000 criminals ; = 6-00247 ±007922 „ „ 1306 The difference between the variabilities in the sample and sub- sample, by result (8) on p. 158, ^0-04346±V[(0-05265)2+(0-07922)2] =0-04346±009512 which is certainly not significant. If the same holds good with regard to the means and other constants, then the whole may be said to be homogeneous so far as this test goes. Example (3). — Another example may be given from the memoir on Variation and Correlation in Brain Weight, by Raymond Pearl, [Biometrika, vol. iv., pp. 13 et seq.]. The author wished particularly to investigate the change of brain weight with age ; on the hypo- thesis that the weight of the brain reaches a maximum between the ages of 15 and 20, remains unchanged from 20 to 50, and then begins to decline and so continues till death, the material was divided into a ' Young ' series, ages 20 to 50, and a ' Total ' series including all between 20 and 80. The ' Young ' series thus formed a selection from the ' Total ' series, but a selection based on age and not on brain weight. If there were no con-elation between age and brain weight, this selection, based as it is on age, would, of course, be random as regards brain weight. Now correlation does exist between the two, but it is so slight that, within the limits FURTHER APPLICATIONS OF SAMPLING FORMULA 167 of error, the ' Young ' series does form practically a random sample of the ■ Total ' series, as is shoAMi by the following figures : — Difference in Variation Constants between Young and Total Series (written with a positive sign ^VHEN the Young Series gives the greater value). Swedes Bavarians Male. Female. S.D. I C. of V. i S.D. I C. of V. + 2-851 ±4066 +0122±0-291 + 4-786 ±o-46o +0-271 i 0-435 -1-888 + 3-5561 -0-173+0-234 l-10-357+3-909|-0-941+0-320 Thus in only one case, that of the Bavarian females, is the differ- ence between the variabilities, S.D. or C. of V., of the two series as gi-eat as its probable error, and even in that case the differences. 10-357 and 0-941, are not three times as large as their respective p.e.'s, 3-909 and 0-320. Dr. Pearl concludes from these and similar results that ' the series are reasonably homogeneous in other respects than age.' The reader is recommended to test Ms knowledge of the formulae for probable errors bj^ applying them to the following examples. Dr. Alice Lee, in a note on Dr. Ludwig on Variation and Correlation in Plants [Biometrika, vol. i., p. 316] makes use of the statistics relating to Ficaria Verna in Example (4). Those in Example (5) are taken from among a large number of others in the highly interesting memoir, On the Laws of Inheritance in Man, by Professor Karl Pearson and Dr. Alice Lee [Biometrika, vol. ii., pp. 357 et seq.] cited once before. Example (4). — Variation and Correlation in Ficaria Verna. No. of Observations. Mean No. of Petals; S.D. Mean No. of Sepals; S.D. Correlation between No. of Sepals and No. of Petals. 1000 (Greiz A) 1000 (Greiz G) 8-286; 1-3382 8-232; 0-9954 3-695; 0-8524 3-437; 0-7033 0-2439+0-0201 0-2480+0-0200 We have here all the data necessary to find the p.e.'s of the means, variabilities, and correlations, and we wish to know whether 168 STATISTICS the differences between the means and variabihties of the A and G plants can be accounted for by random samphng alone. For examjDle, the difference between the petal means = (8-286- 8-232) ± I =0-054±0-035. ; 1 -3382)2 (0-9954)2"| 1000~ 1000~J Clearly this difference, being not so great as twice its p.e., is not significant and may quite well be due to random sampling. Again, the difference between the petal variabilities (1-3382)2 , (0-9954)2"| 2000 2000 = (l-3382-0-9954)±| =0-3428±0-025 which is certainly much too great to be explained away by random sampling merely. Similarly the differences betAveen the sepal means, between the sepal variabilities, and between the correlations, may be tested for significance by comparison with their p.e.'s. Example (5). — Size and Variability of Stature in the Two Generations. Father. Mother. Son. Daughter. Mean height (in.) S.D. (in.) . C. of V. (per cent.) 67-68 ±0-06 2-70±0-04 3-99±0-06 62-48 ±0-05 2-39 ±004 3-83±006 68-65±005 2-71 ±0-04 3-95±0-06 63-87 ±0-05 2-61 ±0-03 4-09 ±005 The student in this case might use one of the formulae for the p.e.'s to find the number of fathers, mothers, sons, or daughters observed when the p.e.'s are known, and then the remaining p.e.'s might be verified v/hen the numbers of observations are found. As evidence of ' assortative mating,' the tendency of like to mate with like, the following particulars are given, based on 1000 to 1050 cases of husband and wife : — Correlation between stature of husVjand and stature of wife = 0-2804±0-0189 span ,, ,, ,, span ,, ,, =0-1989±0-0204 ,, forearm ,, „ ,, forearm ,, ,, =0-1977±0-0205 To measure the average intensity of inheritance, the extent of FURTHER APPLICATIONS OF SAMPLING FORMULA 169 resemblance between parents and children in any character, co- efficients of correlation are calculated such as the following : — Coefficient of Correlation between stature of father and stature of son =0-514±0-015 ,, ,, ,, ,, ,. daughter = 0'510±0-016 ,, mother ,, ,, „ son =0494±0-016 „ ,, ,, „ ,, daughter = 0-507±0-016 [In verifying the p.e.'s for this case take the number of observa- tions to be 1024.] One more extract may be quoted, a prediction table, giving the probable mean stature of sons of fathers of given stature, and so on : — Son's probable stature = 33"73 + 0"516 (father's stature) ± 1 "56 Daughter's „ „ = 30-50 + 0-493 ( „ „ )±1-51 Son's „ „ := 33-65 + 0-561) (mother'sstature)± 1-69 Dauf^hter's „ „ = 29-28 + 0-554 ( „ „ )±l-52. All values given in this examjDle for the la.e.'s should be verified. Before we consider further applications of these principles to questions of a somewhat different kind, let us imagine a very simple though artificial illustration. Suppose we have 999 sheep, each one ticketed, the numbers on the tickets running from 1 to 999. Also suppose 666 of these sheep are white and 333 are black, so that, if we pick out any one at random, the chance of it being black is 333/999 or 1/3. Let us call picking a black sheep a 'success,' then p- 1/3, ^=2/3. We proceed now to select 99 sheep in succession at random from the flock with the understanding that each sheep is returned into the flock before the next is j)icked out. This insures that the chance of a success at each selection remains equal to 1/3 and, of course, there is nothing to prevent the same sheep being picked more than once. The selection might practically be made by placing in a box 999 tickets, numbered from 1 to 999, one to corre- spond to each sheep, then picking out 99 of them in succession, being careful to replace each and to shake up the box before j)icking out the next ; if there were absolutely no difference between the tickets, such as would cause one to be picked more easily than another, the selection made in this way would be random in the 170 STATISTICS sense required, and the tickets so chosen would determine wliich sheep were to be taken and which left. The proportion of black sheep to be exi)ected in such a random selection of 99 is 1/3, but, if we only perform the experiment once, it is quite likely that the proportion we actually get will differ from 1/3 by an amount = 0-6745 V(^g/w) =0-6745V(J . I • A) = 1/31, about, while it is unlikely that the proportion will differ from 1/3 by much more than 3/31, or 1/10. Conversely — and it is rcall}- the converse which is useful in prac- tice— if we do not Icnow the proportion of black sheep m the whole flock, we may get a fair estimate of it by taking a random sample of 99 sheep (anj' other number will serve the purpose, but the larger the better for accurac}'), and if we find that in this sample there are 33 black sheep, i.e. 2>=33/99=l/3, it will appear that the value of x> for the whole flock is 1/3, subject to a probable error 0 -Ql 4:5 \/{2)qj'n) in excess or defect, i.e. the true proportion for the whole flock may quite likely differ from 1/3 by as much as 1/31, but it is unlikely to differ by much more than 1/10. It should be noticed that the calculation of the probable error in this converse case is based upon the value of p given by the sample taken, for that is the only value of which we have knowledge. Too much stress can scarcely be laid on the fact that the samples chosen must be absolutely unbiassed, otherwise the use of the formula) np and \/{npq), or the corresponding proportional formulae, cannot be justified : each sheep in our illustration must have the same chance of being picked, and no one selection is to have any influence on another. The failure to appreciate this essential point has led to no little waste of time and effort in the collection of valueless statistics. The method of sampling has been employed in a way at once interesting and useful by Dr. A. L. Bowley, and, as some of this work has barely received the attention it deserves, it may be well to explain two of his experiments in some detail. The first was of interest because its results could be tested by an examination of the original record from which the sample was taken. The details concerning it are abstracted from the Journal of the Royal Statistical Society, September 1906. Example (6). — Bowley sampled the dividends paid by 3878 FURTHER APPLICATIONS OF SAMPLING FORMULA 171 companies as quoted in the Investors' Record. His sample con- sisted of 400 of these companies, i.e. about 10 per cent., selected in a purely arbitrary fashion thus : the investigator took a Nautical Almanac and noted down the last digits of one of the tables, record- ing them in groups of four, but if any particular group gave a number bigger than 3878 he rejected it. In this way each of the numbers between 1 and 3878 had an equal chance of selection (for numbers under four figures would appear like 0327, 0042, 0009, which would be taken to represent 327, 42, 9 respectively), and the selection of one had no influence on that of any other. The com- panies in the Investors' Record were numbered consecutively, and the dividends corresponding to the 400 arbitrary numbers obtained formed the sample with which Bowley worked. After making some interesting deductions with regard to the average for the whole distribution, to which we shall return pre- sently, he proceeded to forecast the grouping of the original com- panies as to their dividends hy setting out the grouping discovered in the sample 400, as follows, using the standard deviation in place of the probable error as the error due to random sampling ; — Table (39). JJistribution of Dividends paid by a Sample of 400 Companies. (1) (2) (3) (4) Dividend. Sample of 400 Companies. Percentage of Sample Companies in each Class. Percentage of all Companies in each Class. Nil £1 to £2, 19s. 9d. £3 to £3, 9s. 9d. £3, 10s. to £3, 19s. 9d. £4 to £4, 9s. 9d. £4, 10s. to £4, 19s, 9d. £5 to £5, 19s. 9d. £6 to £7, 19s. 9d. £8 to £10, 19s. 9d. Above £11 28 6 37 71 64 53 60 48 29 4 7 with S.D. = l-27 U 9| .. -1-46 17| „ -1-90 16 „ =-1-83 13i „ =1-68 15 ,. -1-78 12 ., -1-63 7i „ -1-29 1 6 1-5 8-4 18-8 17-3 13-8 17-7 10-8 38 1-9 In col. (3) the S.D. for each group was calculated as follows : — for the first group : out of 400 possible events we have 28 successful events, meaning by ' successful ' here ' a company paying no dividend,' thus ^-: 28/400, g-= 372/400. 172 STATISTICS Hence the S.D. of the frequency in the fii'st group =V[28(i-AV)] = V(28x372)/20 =5-1. Since this is for a sample of 400, the S.D. of the percentage* frequency in the first group -i(5-l)-l-27. The other S.D.'s are calcuhxted in the same way, but when the number in a class is very small the forecast can scarcely be relied upon and consequently the S.D. is not inserted. It will be noted, by comparing with the numbers in col. (4), showing the corresponding percentages for all the 3878 companies, that every forecast was remarkably good except one, class £8 to £10, 19s. 9d., where the error approaches three times the S.D., and the exception Avill serve as a warning that, in working with samples, the unexpected sometimes happens. Professor Edge worth, in his Presidential Address to the Royal Statistical Society (1912), points out that the method appears to be a permanent institution in the Statistical Bureau at Christiania, where it has given very good results. These can be checked or ' controlled ' for safety if complete statistics are obtainable under some heads. He faii'ly sums up the utility of sampling when he says that ' we may obtain from samples a general outline of the facts — often sufficient for the initiation of a project like that of insurance— rather than the features in detail.' Bowley also divided up his 400 random samples into 40 groups of 10 companies each, and calculated the average for each group. The S.D. for these 40 averages was found in the usual way, giving 0-775. But since this wsiS the S.D. for averages of 10, we conclude that (the S.D. for the distribution of the400companies)/'\/10=0-775 i.e. the S.D. for the distribution of the 400 companies=0-775y'10. Hence, applying the same principle again, the S.D. of the average of the 400 sample companies -=0-775V10/\/400 =£0-122. [* It would not be correct to take v'[7(l - ti^ij)] as the S.D. of the percentage frequency in the first group ; this value would be double the true value, namely, J v'[28(l - i\%)] = h \^n(^ ~iIt))], because the accuracy is increased by increasing the number of events in a sample, and the sample here is really 400 and not 100.] FURTHER APriJCATIOXS OF SAMPLING FORMULA 173 Now the average of the 400 samples turned out to be £4-7435. Hence it was judged that, if this was a fair selection (and the rando)u method adopted was such as to make it fair in all reasonable likeli- hood), the average for the 3878 companies should certainly lie between £[4-7435±3(0-122)]. The true average was found by actual calculation to be £4-779, well within the above limits, although the original items varied from nil to £103, being grouped according to the nature of the security — Government, Railways, JMines, etc., etc., and the averages and S.D.'s on successive pages differed materially. This aggregation, Bowley remarks, is very similar to that found in wages in different occupations and localities, and in man\^ other practical examples. The value of the second experiment due to Dr. Bowley lies in the suggestion that similar means can be ajDplied with good results to the investigation of many social phenomena. If out of a large group a comparatively small sample of statistics is collected in the purely random manner already described, we are able by such means to estimate what is the average, and even to obtain limits between Avhich the average A\ill almost certainlj- lie, in the large group based upon values found for the average and S.D. in the small sample. Example (7). — With the collaboration of Mr. Burnett-Hurst and a number of other workers, Dr. Bowley conducted an inquiry into the conditions of working-class households in four representative towns — ^Northampton, Warrington, Stanley, and Reading — the results of which are published by Messrs. Bell and Sons under the title of Livelihood and Poverty. They are similar in character to those obtained by Rowntree in his study of conditions in York, but what is peculiar to Rowley's inquiry is that only a sample, about 1 in 20, of the working-class houses in each town was examined, and the conditions in the towns as a whole were deduced from these samples. We are not concerned here with the actual facts disclosed by the investigation, striking as they are, but with the explanation of the sampling method adopted, and as to that it may be remarked that the foundation on which it rests is precise^ the same as that which underlay the example of the 999 black and white sheep. The main point to notice here again is that Bowley was careful to select his samples in unbiassed fashion as follows : ' For each towni a list of all houses . . . was obtained, and without reference to anything 174 STATISTICS except the accidental order (alphabetical by streets or otherwise) in the list, one entry in twenty was ticked. The buildings so marked, other than shops, institutions, factories, etc., formed the sample.' It will be evident that tliis method of choice is not quite on the same level of randomness as that followed, for example, in drawing cards from a w^ell-shuffled pack, each card to be replaced and the pack reshuffled before the next is drawn ; but, for that very reason, the results of the experiment are all the more Hkely to be well within the limits of error provided by the formulae of the ideal case. The deliberate selection of every twentieth house in each street is likely, that is to say, to give a more representative picture of the tovm. as a whole than would be obtained by selecting the same number of houses in a purely random fashion which might by chance give too much emphasis to some street or district. A practical test of the goodness of the sample was possible by comparing the results in a few instances with information availa])le from other sources. In order to make the method of working quite clear, let the guiding principle first be recalled : — ' If , in a random sample of n items, the proportion of successes is p, then the proportion of successes in the universe from which the sample is selected will not be hkely to fall outside the limits p±3(0-6745)V(i3?/w), and, if that universe contains altogether N items, the number of successes will not be likely to fall outside the limits Ni)±3(0-6745)NVMw)-' In Reading the total number of all inhabited housea in the borough was 18,000 at the time of the inquiry, i.e. N^ 18,000. The total number of houses visited was 840, i.e. ?i=840. If we call a house assessed at £8 or less a ' success,' the number of such houses found in the sample was 206.^ Thus 2^=206/840, g= 634/840, and the number of houses rented at £8 or less in the whole borough should be N^ with a p.e. = 0-6745N\/ (^0-/71) i.e. 4414+180. The actual number of houses so rented was known from other sources to be 4380, well within the limits forecasted. The value used for p in the above is that given by the sample, but when we know the actual number of successes in the universe rURTHER APPLICATIONS OF SAMPLING FORMULA 175 as a whole, as in this case we do, we might use the true vahie of p, i.e. the value for the universe in place of that for the sample. The argument might also be put in another way without affecting the principle employed, thus : — The number of houses rented at £8 or less in the whole borough was 4380. But the proportion of houses sampled in the whole borough was 840/18000, i.e. 1/21-43. Hence the number of houses at the above rental to be expected in the sample^ 4380/2 1-43=204. The number actually found in the sample was 206, with a probable =0-0745 V(840 X r%\% X \W%^) = 8, approximately. Again, the number of persons engaged in a certain occupation at Reading was known to be 761 in the borough as a whole. Hence the number of persons so engaged to be expected in the sample was 761/21-43, i.e. 35. The number actually found in the sample was 29 with a probable ®^'°^ = 0-6745 V(w;>^) =0-6745 V(840 X tUU X kiUl) =4, approximately. Further examples of the method are here given, in each of which the total number of events is small so that the number in each sample is also small, and since, as we have seen, the accuracy or precision of the proportion of successes discovered in any sample varies directly as the square root of the number of events the sample contains, the results cannot be expected to be so good when this number is small. Example (8). — 514 candidates sat a certain examination paper ; their marks ranged from 3 to 64. The candidates were numbere<-l consecutively from 1 to 514, and a random sample of 90 (17i per cent.) was selected from among them by writing down the 90 numbers formed by the digits in the seventh decimal place, taken in groups of three, in the logs of the numbers 10104, 10204, 10304, . . . , as given in Chambers's Tables, neglecting all numbers greater than 514 and calling such numbers as 005, 037, etc. — 5, 37, etc. In this way each of the numbers between 1 and 514 stood an equal chance of inclusion. 176 STATISTICS The distribution of candidates in the sample is compared with that for all 514 together in the following table : — Percentage of All Percentage of Candidates in No. of l\Iarks Obtained. Candidates who obtained Sample who obtained these Marks. these Marks. p.e. Less than 15 8 8±l-9 15 but less than 25 19 17±2-6 25 „ „ 30 16 18±2-7 30 „ „ 35 18 13±2-4 35 „ „ 40 15 17±2-6 40 „ „ 50 19 18±2-7 50 and over. 7 10±21 The reader might verify the p.e.'s given in the last column : e.g. proportion in the sample obtaining less than 15 marks=7/90 ; therefore ^^=7/90, g= 83/90. Hence the S.D. for this group -V[7(1-9V)] -:2-54, and the S.D. for the percentage = -V(rX2-54-2-8. Thus the p.e. for the j)ercentage = ^a— 1-9, approximately. Exami)le (9) deals in a similar way with the data concerning infectious diseases in 241 towns in England and Wales previously recorded on p. 62. A sample of 60 towns, i.e. about 25 per cent., was chosen in a random fashion as in the last example, and the sample distribution is compared below with that of the 241 towns as a whole. The verification of the probable errors in this and the next case is left to the reader. Case Rate per 1000 of the Population. Actual No. of Towns so rated. No. as suggested by the Sample. 1 and under 5 5 „ 9 9 „ 13 13 and over. 85 86 42 28 p.e. 92 ±10 96±10 28± 7 24± 6 FURTHER APPLICATIONS OF SAMPLING FORMULA 177 Example (10) is concerned with the annual output j)er head in 142 different types of employment as given in 1907 by the Census of Production [data from Sixteenth Abstract of Labour Statistics of the United Kingdom, Cd. 7131]. The distribution suggested by a random sample of 50 different occupations is compared with that of the complete list of 142 occupations. No. of Occupations No. in Complete Actual No. Output i)er liead. in Sample with List as deduced found in this Output. from Sample. Complete List. p.e. Under £60 4 ll±3-6 12 £60 and under £80 16 45 ±6-2 42 £80 „ £100 6 17±4-3 25 £100 „ £120 10 28±5-3 20 £120 ., £190 8 23±4-9 27 £190 and over 6 17±4-3 16 The S.D. in each of the last three examples has been calculated by using the value for p given by the sample, which is the value one must fall back upon in practice when the true p for the whole distribution is unknown. In any case where ^^e are able to test our sample by comparison with the whole distribution, however, it is possible to use the true value of p, e.g. in Example (10) output £100-120, 2)=-20/142 as opposed to 10/50. M CHAPTER XV CURVE FITTDSTG PEARSON" S GENERALIZED PROBABILITY CURVE It may be recalled that in the introductory chapter an outline A\as given of the manner in which the theory of Statistics might be conceived to develop. It was shown how the desire for simplifica- tion and the need for compression leads to the division of a large mass of figures dealing with any given matter into groups ; indeed, it may well be that the statistics have been so arranged at the source in the act of collecting : e.g. we may have to deal with so many males of height 54 in. and less than 55 in., so many of height 55 in. and less than 56 in., so many of height 56 in. and less than 57 in., and so on. Here corresponding to each given height, which we maj^ label x, or each range of height, such as x^ to Xg, we have a certain frequency' of males of that height or range, which frequency we maj^ label y, and hence a frequency table can be formed showing the variation of y with x. Further we have seen how such pairs of corresponding values of x and y can be plotted so as to picture the complete observed frequency' distribution to the eye. Now the representation thus made, though helpful up to a point, is not entirely satisfactory. Whether we simplj'^ join up successive points (.r, y), or set up rectangles of varpng height y on bases spanning the successive ranges of x, or erect ordinates (y's) at the mid-points of these bases, joining the summits in the manner previously described, the connection so established betAveen each observation and the next is too superficial, depending merelj^ on the fact of casual neighbourship, and may sometimes give a false impression of frequence' and changes in frequency in the population of which the observations are but a sample. And this is neces- sarily so if we confine ourselves strictly to the data observed. One difificulty which has to be faced is that only "svithin certain broad limits can we trust our observations to give us information which is trul}^ representative of the population in which we are 178 CURVE FITTING 179 interested. We seldom if ever deal with the wliole poiiulation : in fact it maj' be so large that it is impracticable even to reckon it ; instead we make a random or unbiassed selection of a smaller but adequate number of individuals belonging to the population, and classify them according to the size or nature of the character which concerns us. But, granted that our sample is adequate in size and unbiassed, the numbers obtained in the different groups of the frequency distribution will still be subject to the errors of random samj)ling, and it is only after these errors have been calculated that we can lay down the probable limits within which our sample may be regarded as really representative of the population as a whole. Another difficulty arises owing to the fact that our observations in general do not cover the whole field of values of the variables x and ?/ ; we may quite Ukely M^ant to know the percentage frequency, y, of individuals with a character (height or whatever it may be) x which does not chance to be any one of the x's observed, if the observations are only recorded according to discrete (separately distinct, like 5 ft., 6 ft., 7 ft.) values of x ; on the other hand, if the observations have been classed in groups, the frequency in which we are interested may refer to an x which does not coincide with the centre of an}- group or which is even outside the range altogether. We have therefore further to inquire whether such information can be deduced in any waj^ from the statistics collected. Now it so happens that both these difficulties disappear if we can only attain the ideal already outlined in discussing graphs, and find a suitable curve to ' fit ' the statistics observed. 8uch a curve would not necessarily pass through all or any of the points [x, y) representing the observations, for these, as we have remarked, are subject to errors of random sampling and the observed frequency y of any x may be greater or less than the corresponding y in the population at large to which the curve is presumed to approximate. The curve in short must remove the roughnesses which are in- separable from ordinary observation. Moreover, given any x, not merely one of the x'a observed, it must be possible to read off from it the corresponding y, the frequency appropriate to that x. It is not always accurate enough for our purpose to draw a curve by 6ye> passing as evenlj^ as possible through the middle of the points observed in the manner conceived in an earlier chapter. It is necessary in some way to find an algebraical formula, possibly even a trigonometrical, exponential, or more complex expression, which will give the y corresponding to any x desired. This formula or equation must depend upon the statistics collected : i.e. the 180 STATISTICS constants involved in it must be directly and fairly easily computed from the y'a observed, and the results of all the observations should enter into the equations wliich determine the constants in order to make use of the full information at our disposal. In addition, the method of determining the equation and its constants should be as general as possible, so relieving us of the trouble of discovering a new method owing to the failure of the original one at nearly every trial. Finally, the equation should not be so intricate as to make the labour of calculating y for any given x too heavy to be attempted with the ordinary equipment at the statistician's disposal. Once such an equation is found it is a fairly straightforward proceeding to trace the curve for which it stands, and it will remain after^\'ards to test the goodness of fit in some more refined way than by seeing how closely it passes through the observed points by eye. When we come to review the shapes of the frequency polygons or histograms most commonly met, we find that the majority of them start from low fre- quency, rise to a maximum as X. the character observed, increases, then fall again to- M^ards zero very likely at a Fm. (27). . , different rate. In fact the statistics suggest a shape something like that shown in fig. (27) for the corresponding frequency curve, though we cannot be sure that it would coincide -with the axis at either extremity. [Cases do occur where the curve has two or even more humps (maxima), but we purposely restrict ourselves to the simpler and more frequent tj^pe described.] Now the simplest shape to deal with from the algebraical point of view would certainlj^ be symmetrical in character, corresponding to statistics ^^'hich rise and fall at the same rate, though this would not necessarily be the most common shape among the records of actual life. In order to simplify our i^roblem, therefore, we might start by making up for ourselves an ideally simple set of statistics which are jierfectly symmetrical, and see whether we can discover a process for fitting a curve in a case of that kind. If this prove successful it might be possible afterwards to adapt the same process to an unsymmetrical or ' skew ' set of statistics made up in a similar way. Then finally we should inquire whether actual observations conform to any of the types of curve discovered, and, if so, how they can be fitted together. Now in manufacturing our statistics we must keep before us the CURVE FITTING 181 object at which we are aiming. Given the statistics, what we want is a formula, algebraical or of some other kind, to fit them. This raises the possibility of choosing the statistics themselves in some algebraical form, and such a form is at hand in the binomial expansion, which is, in fact, one of the first examples of a general symmetrical expression one meets. Thus (a+6)2=a2+2a6+62 (a+6)3=a3+3a26+3a62+63 (a+6)^=a'^+4a36+6a2624-4a63+64 (a+6)5=a5+5a^6+10a362^10a263+5ai^+65 (a+6)«=a«+wa«-i6+^^:^^'-^V-262+ . 1-2 n{n- 1 )^26n-2^ ^„6^-i+ 6". 1-2 Clearly all these expressions become perfectly symmetrical if we put a=b, for they read the same whether we run from left to right or from right to left. We have already seen what an important part the binomial expansion plays in the early stages of the theory of probability : e.g. (I+I)^". when expanded, tells us at once the proportion of times on the average we may expect 10 heads, 9 heads and 1 tail, 8 heads and 2 tails, and so on, when we toss an evenly-balanced coin ten times in succession ; or again, if p is the probability that a certain event will happen, and q the probability that it will fail to happen at one trial, then the probabilities that it will happen p times, (^—1) times, {p—2) times, ... in w trials are given by the succes- sive terms in the expansion of ip-\-qY- However, we make no assumption for the moment as to the values of a and 6, except that in the symmetrical case with which we begin they are equal, and we have as the successive terms of (a+^)" '■ — n{n—l] a'\ na' 1-2 Let us suppose that our observed statistics take the above form so that these terms may be plotted as a succession of ordinates, i/v y-i' 2/3» • • • > Vn+v associated with abscissae, x^, .Tg, x^, . . . , x^+i, at equal distances apart measured, say, by c ; for convenience we may place the origin as in fig. (28), so that Xi=c, X2=2c, x^=Sc, . . . , a;„^j=(?i-j- i)c, 182 STATISTICS and we can then form a frequency polygon, where ■-re, ijf- n{n~\){n-2) 1-2-3 . . . (r-1) (n— r+2) are typical values of a pair of the variables x and y, each such pair defining a vertex of the polygon. Now in this case, since the statistics have been artificially built up by ourselves and are not in reality a random selection, they are Fio. (28). not subject to errors of sampling and the fitting curve should, therefore, pass through the summits of all the t/'s, or, perhaps better, touch each of the lines joining adjacent summits. The curve only differs from the neighbouring outline of the polygon in that the latter is discontinuous, it alters its direction relative to the axis of X by jerks at equal intervals c measured along OX, whereas the former must rise gradually and continuously and then fall in the same way. This is one sense in which we mean that the fitting curve removes the roughness of the observation statistics — it gets rid of jerks besides filling gaps in the observations. It will be clear that as n increases and c diminishes (and this is what we aim at in collecting statistics, though it has not been assumed in what immediately follows) the discontinuity in the polygon becomes less and less pronounced and the outline of the figure approximates more and more closely to the curve. Moreover this approximation gains in intensity if we make the slope of the curve at each appropriate i^oint the same as the slope obtained by joining up the summits of adjacent ordinates of the polygon. (yr^.-yr) Now the expression {yr+X-VrMo CURVE FITTING 183 is the measure of the gradient from the rth ordinate to the (r+l)th, and yr+x-yr_0'' c c = 2/r n{n—l) {n—r-\-l) n{n—l) w(n- 1-2 ■1) (w— r+2) 1-2 -2r+l (r-1) 1-2 r+1 (n-r-f2)"| re If this be also taken as the gradient of the tangent to the curve at the point midway bet\\'een (x^, 2/r) ^^^ i^r+v Vr+iit calling this point {x, y) we have, since, in the notation of the differential calculus. dy dx is the measure of the gradient of the curve at this point, dy Vr+i-yr dx =yr n—2r-\-l re And .T=i(.r,+z,+,)=K^c+(r+l)c]=---(2r+l) y=l(yr+y,+i)=^^ ^- ^^ {n-r+2) 1-2 Hence n- yr— Thus 2?-+l 2ry . (r-1) (n+2)-{2r+l) r+1 ^1^(^ + 1)- 2r 2^ re ?i+2— — (w+l)c\ c re n-\- 1 c?a: (n-l-l)c\ ej But if M'e had started with any other two adjacent ordinates instead of y^ and y^^-^^ we should have been led to exactly the same relation connecting tlie corresponding x and y of the required curve, for r, which serves to particularize the ordinates, does not appear in the relation at all — their individuality has been eliminated. The above equation may thus, if we please, be taken as holding good for, and therefore defining, all pomts {x, y) of the fitting curve : it is, in short, the differential equation of that curve. The equation may be slightly simplified by transferring the origin to the point (w+2)- . 0 , evidently the point 0' in fig. (28) 184 STATISTICS corresponding to the maximum ordinate of the polygon or curve. Algebraically, this merely means that for x we must WTiLe [(n-\-2)c~] x-\-- ~ in the equation, which then becomes 2 J 1 dy 2y f 2x\ Axy dx (w+l)c\ c/ (w+l)c2 We may pass to the equation proper of the curve by integration. Thus, separating the variables, lxdx=0. J y (n+1) 2x^ Therefore, log t/+-^-— +A=-0, {n-\-l)c^ where A is a constant. Hence y=y^e'-''''"<''+'^\ where ?/q is a new constant. This may be WTitten y_y^e--/^-, . . . (1) where a^={n-\-l)c^j4:, and it is called the 2)robability curve or normal curve of error* Let us now see whether the ^jrocedure so far followed is applicable in the case of an unsymmetrical or skew distribution of statistics. With this object we will suppose the frequencies of observations in successive groups to be represented by the corresponding terms in the expansion and as before we can form a frequency polygon by joining the summits of the ordinates n(n—l) „ „ yi=p''> yt=nv''-% 2/3=^-^~i^ '"-r. • • • . yn+i='t, [* Karl Pearson's method of getting the normal curve equation has been adopted as tlie basis of the above discussion, in preference to that usually followed, which develops the curve also from the binomial expression but some- what on tlie lines of Laplace and INjisson. They showed that the sum of all the terms lying within a range t on either side of the maximum term in the expan- sion of (p + q)" is approximately 1 ^/2w■<7 ■ [+t J-t where (r= s,l{n2)q), whence the equation of tlie curve is derived. (See Historical Note at the end of Chapter xviii. )] CURVE FITTING 185 erected on the axis of x at distances from the origin given by the figure being very similar to that in the symmetrical case. The gradient of the fitting curve where it touches the join of (^r> Vr) to (^,+1, Vr^y) is given by dx c and we must try and express the right-hand side as before in terms of {x, y), the co-ordinates of the mid-point of the line joining (^r. y,) to (XV+I, yr+x). We have dy_\ dx c 'n{n-\). . . in-r+l)^n-Y_ ^(^-1) • • • (^-^+2) n_,.+Y. 1-2 1-2 .. . (r-1) ■1 ^n_Y-i n{n-l) (w— r+2) 1-2 (r-1) n—r-\-l q~p Also 2x=x^.-\-Xj.^i=rc-\-{r-\- l)c=^{2r-\-l)c ?i('/i— 1) . . . (/I— r-f2) 2y=yr+yr+i = -^r^^^ , — 7^ — • p 1-2 Thus dy_2y/n—r-{-l dx c\ r q-p )/( (r-1) w— r+1 n-Y-i 7i—r-\-l , q^p r g-^p =^[{n+l)q-r{p+q)]![{n+l)q+r(p^q)] c 2y, ^[2{7ii-l)qc-ip-{-q)i2x-c)]![^in-i-l)qc+{p-q){2x-c)]. This, being true for all such pairs of values of x and y, is now in a form independent of any particidar point on the curve we seek ; in other M'ords. it may be taken as the differential equation of the curve, and it is evidently of the type dy ia—x) dx {p-\-yx) where a, /3, y involve only p. q, n, etc., the constants of the distri- bution we set out to fit. -' y ■'yx-\-h 186 STATISTICS The equation is simplified if we transfer the origin to the point (a, 0), when it becomes dy^ yx _^ • dx yx-\-h where 8=j3+ya. To integrate, separate the variables as before : {dy y Therefore,. log y+lf?Z^±^~^dr=0 yJ y.r-j-S log 2/+^-- log (ya-+S)+A=0, y y where A is a constant, or j/=Be-^^^(y.T+8)% where B is a constant. It may be written y=y„e-''^(l+^V' . . . (2) where A;=l/y, a=8/y, and y^ is a new constant. This, then, may prove a suitable type of curve to fit a set of statistics forming a skew frequency distribution, but the question now arises whether equations (1) and (2) are the most general types possible. Clearly (1) is only a particular case of (2) obtained by making 2^=9, and, this being so, (2) may itself be a particular case of some still more general type. Light may be thrown on this if we consider the geometrical bearing of the differential equation obtained in the last case : dy yia—x) dx j8+y.r (3) The presence of y and {a—x) in the numerator of the right-hand dy side of (3) shows that — vanishes when y=0 and when .r=-a, i.e. the dx curve touches the axis of .r ^^hf ro the two meet and there is a maximum point on the curve at .r=^a. (Since a is the particular value of the organ or character x for A\hich the frequency is a maximum, a is of course the mode.) Now these two characteristics are the very ones to which we a\ ished to give symbolical expression since they serve to describe in broad outline what was agreed to CURVE FITTING 187 be the trend of the majority of frequency distributions — the rise from zero to a maximum, at first gradually, then faster, and, after passing through the maximum, the fall to zero again, generally at a different rate. As to the denominator of equation (3), the corresponding equation for type (1 ), before the origin was changed, was similar to equation (3), except that it contained no x term in the denominator, and that is readily understood when we note that y is a multiple of (p—q) and thus vanishes when p—q. Now, if from (3) we get a less general type of curve by dropping the x term in the denominator, we may perhaps get a more general type by adding an x^ term, and even an x^ term, an x* term, and so on. In fact there seems no reason why the denominator should not be any function of x, say f{x), which, however, we shall suppose for simplicity capable of expansion in a Maclaurin's series of ascending poAvers of x which converges quickly. We are led to propose, therefore, as more general than C3), the differential equation dy_ yjx+b) ^ ^ dx px'-\-qx^r We stop at X' in the denominator because it has been found, if we may anticipate results to save needless labour, that beyond this point the heaviness of the calculation involved and the decreasing accuracy of the higher moments that have to be introduced out- weigh any other advantage gained. The curve or set of curves resulting from the integration of equation (4) is knoAATi as Karl Pearson's Generalized Probability Curve, and their author has stated that, while it comprises the two other types as special cases, it practically covers all homogeneous statistics he has had to deal with. Just as the differential equations in the first two cases considered were related respectively to the s3aiimetrical and the skew binomial expansions, so is equation (4) related to the hypergeometrical expansion the successive terms of which express the probability that r black balls, (r— 1) black balls and 1 white ball, (r— 2) black balls and 2 white balls, . . ., r white balls, will be drawn from a bag contain- ing pn black balls and qn white ones, where {p-\-q)=^\, when r balls are drawn in all, each being replaced before the next is drawn. 188 STATISTICS If the terms of this expansion are represented by ordinates of which the summits determine a polygon as in the binomial eases, the corresponding expression for the gradient of the curve at any point is given by an equation of type (4). We need not go over the detailed proof of this statement since it follows precisely the same lines as in the previous cases. The method of integration of the equation dy_ y{x-\-b) dx px--\-qx-\-r depends upon the nature of the roots of the quadratic in the denominator which may be Avritten px'-\-qx-\-r=p \4p'^ pi _ =P [(-si- 4r2 q^ / 52 q- 4:pr ^ ipr =P [(-4)"- 4r' ' where K^q-j4pr, and it is evident that the quadratic splits up into real factors if /c(/c— 1) is positive. This is the case when k has any negative value, or when it is positive and greater than 1, the truth of which may be seen more efifectively if the curve ?/=«(«— 1), Y II K- / K-*r(>\) o\ /K=1 k K= 0^ ^ Fia. (29). IS a parabola symmetrical about the line K=\, be drawn, fig (29), by plotting y against k. Further, the product of the roots of the quadratic px--^qxA^r=^^ r_4/-- cf' _4/-- p q- 4pr q- so that the roots when real will be of the same sign if k is positive and of opposite signs if k is negative. The boundary lines /c=0 and « = ! thus divide the whole field into three parts, as shown in fig. (30), in one of which the roots are real and of opposite sign, in the next CURVE FITTING 189 the roots are imaginary, and in the third the roots are real and of the same sign. At the boundaries we get particular cases as follows : — K=0 : this requires q=0, since /c=q-l42)r, which makes the roots of the quadratic equal but of opposite sign, unless ^=0 also, and in that case both roots are infinite ; « = 1 : tlie roots are real and equal and of the same sign ; « = oc : this requires p =0 or r =0 ; in the former case one root of the quadratic is infinite, and in the latter one root is zero. Y <*-, =n 2 / 'k> \ - 13 / ^ \ c .tl v h =.A II ^ >^ v^ o "^ OD ■~ c e eo \ ""^ ^ / II to g \a: / Z c 0= / 0 o / IT K Fio. (30). Thus, returning to the differential equation, the curves which result from the integration fch/ f {x-[-b)dx J y J px--\-qx-\-r are of different types according to the value of /c, which is therefore called the criterion. Type I. — K—^'. Roots of px--\-qx-{-r—^ real and of ojrposite sign. In this case we may write and so get ]}X~^qx^r^^p{x-\-a'){x~^') {X'\-b)dx ; y .' via ■ ^0, p{a' -^x){^' — x) or, transferring the origin to the point ( — 6, 0), the mode, we have (dy f xdx J~y Jp{a-b-]'X){fi'-\-b-x) 'dy . I xdx =0. or J y J i){a where Therefore, lo +x){^~x) dx . 1 ^0, 1 ; a ax I I pJ a+x a-\-B pJ I ^ dx a=a 1 pJ a+x a+j8 ' pJ ^—x a+/3 where A is a constant. A=0, 190 STATISTICS Thus log y=^-^—^a log (a+a-)+iS log (^_;r)]+log B, where B is a constant, whence y=B(a+.r)p(«+«(^— a;)''("+^) i.e. yM'+a) {'-^) ■ • ■ f'^' where i/ = l/p(a+^) and y^ is a new constant. This is a skew curve of limited range, bounded by the lines .r =— a and a;=+jS, with the mode at the origin. Type II. — «:=0. 9=0, but not p=0. Roots of 2JX^-{-qx-]-r^0 equal and of opposite sign. This curve is just a particular case of type I., Avhich reduces to / x^ \ ^"^ y=yo 1---2 . • • • (6) symmetrical about the axis of y (because for any value of y there are two values of x, equal and of opposite sign) and of limited range bounded by a: =— a and x = -\-a, with the mode at the origin. Type III. — K=oc.* p=0, but not r = 0. One root of2)x'^+qx-\-r=0 infinite. This is the skew binomial case over again. It may be also de- duced from type I. by making one root, say jS', tend to infinity. The curve then takes the form because j8=^' + 6, so that ^ tends to infinity with /3'. Hence l\x- a/ x^co '+A where A=— ^/x. Thus y=y (l+^y^e"", ... (7) \ a ' a skew curve limited in one direction by the line x=—a, with the mode at the origin. [* Although theoretically this type corresponds to an infinite value for k, in practice it will as a rule give a reasonable fit provided k is numerically greater than 4. (See W. P. Elderton's Frequency Curves and Corrdation, p. 50)]. CURVE FITTING 191 Ty2>e IV. — k -]-''"' and < 1 . Boots of px^ 4- g-a; -f r = 0 imaginary . Put «(«— 1) = — A", and the differential equation then leads to rdy r {x-\-b)dx J u Transfer the origin to the point ( — — , 0 X'{-b—-' ]dx 2p/ log ,==.4+^1 log (.=+4'!iV(^ -/-,)/, tan- ^ 2p \ q- / \-p 2p'/2rA 2rA where A is a constant. X2\-"> -,-tan Therefore, y=yj If „ e * ... (8 a , 2rA 1 If, q where a= — , m —— — , v =— — o— -^ q 2p ap\ 2p and ?/o is a constant. This is a skew curve of unlimited range in both directions. The position of the mode is found by putting — =0 in (8) after differ- dx entiation, or, what comes to the same thing, is seen by direct refer- ence to the differential equation itself. Thus the distance of the mode from the origin Type V. — K=\. Roots of px^-\-qx-\-r=0 real and equal. The equation to integrate becomes {dy f {x-'rb)dx lU ' '^< 192 STATISTICS Transfer the origin to the point I — — , 0 ), and this becomes y J ^j,r- log t/= A+- log x-^.ib- , p p \ 2pjx where A is a constant. Therefore, y--^tj^x^''Pe ^'^ ^^'^^ y--yoX-'e-v/^, . . , (9) where s — — !/}), y= ( 6 — — ), and ?/(, is a constant. Here x cannot become negative, so that the curve is skew and limited in one direction. The distance of the mode from the origin Type VI. — K-{-'" and >1. Roots of ■px-'^qx-^r=0 real and of the same sign. Equation becomes fdy I {X'^b)dx y J p{x-^a){x+^) logy = b-a 1 (6-iS) 1 jpi^—a) .T+a l'>{a-~fi) x-\-fi_ dx =A+— L_[(6-a) log (.t■-^a)- (6-/3) log (.r+^)]. p(^-a) where A is a constant ; or, transferring the origin to (— jS, 0), log 2/=A+~^^^[log j.r-(^-a)p-log x^^ y=yo(x-ar-x-^', • • . (10) where a=:^-a, qz^{b—a)lp{^—a), qi = {b-^)ip{^—a), and y^ is a constant. This is a skcAv curve bounded by x—a in one direction. The distance of the mode from the origin^ — (6— j8)=agj/(g'i— ^a)- CURVE FITTING 193 Type VII. — K=0, q=0,p=0. Roots of the quadratic px--\-qx+r=0 both infinite. This is the symmetrical binomial case over again and the integra- tion reduces to fdy_ ■' y J^-'d., J r or, transferring the origir I to (- b, 0), 1 y -\> logy-- =A+ ^x\ 2r ' where A is a constant. Therefore y- =yoe . (11) where y^ is a constant and ct-= — r. This curve, the normal curve of error, is symmetrical about the axis of y, where mean and mode coincide, and it is of unlimited range on either side of it. K CHAPTER XVI CURVE FITTING (continued) — THE METHOD OF MOMENTS FOR CONNECTING CURVE AND STATISTICS We have now completed the first stage of the discussion upon which we embarked : we have found by the apphcation of general prin- ciples various types of curve, represented by different equations, which are said to fit more or less satisfactorily a considerable number at all events of frequency distributions composed of homogeneous material. Our next task is to pass from the general to the particular, to see how to set up a connection between an actually observed fre- quency distribution and the appropriate theoretical curve. This again seems to break up into two parts — (1) to find a way of deciding which type of curve to adopt in a particular case ; (2) to determine the constants of the curve in terms of the observed statistics ; but since the criterion, k, which distinguishes one type of curve from another is itself a function of the constants of the curve before integration, it follows that the solution of the first part is incidental to that of the second. The general method proposed for determination of the constants of the curve in terms of the observed statistics is the now well-known method of moments due to Karl Pearson, whereby the area and moments of the fitting curve are equated to the area and moments, calculated from the statistics, of the observation curve. If a frequency table be drawn up (see Table (40)) showing the number / of observations corresponding to the deviation x of each value, or group mid- value, X of the character observed from some fixed value, the expression ^'Ji^^-Ji+ ■ ■ ■ -h^rfr+ ■ • • is called the first moment of the distribution with reference to the fixed value, which may be termed the origin. Similarly, ^Vl + ^"2/2+ • • • +-^'-r/r+ • • • is called the second moment, Ux^, the third moment, Ux*f, the 194 CURVE FITTING 195 fourth moment, and so on. The following notation will be found convenient for working purposes :— vi ^ Z"/ ' '' ' N 2"/ ' * * * Undashed letters are reserved for use when the distribution is re- ferred to its mean as origin, in other words when the deviations of the X's are measured from the mean X. Table (40). Deviation. Frequency. First Moment. Second Moment. Third Moment. Fourth Moment. X, Xj A 'fr ^-rfr ^2/2 ^\fr Totals . N N'l N'2 N'3 N'4 Now each N in the frequency table is the sum of a number of discrete quantities which only tend to form a continuous series as the class intervals are made very small and the number of observa- tions is made very large. The corresponding frequency polygon or histogram, if we drew it, would at the same time tend to become a continuous curve, the observation curve. If that limiting stage were attainable, if we could actually get an infinitely large sample of observations in which the character observed changed by infinitesi- mal amounts, we could then replace the isolated /'s of observation by the corresponding ?/"s, the ordinates of this observation curve, and to get the moments we could write instead of the discrete sums Ef, Sxf, Ex'^J, . . ., the continuous integral expressions jy'dx, jxy'dx, jx~y'dx, . . ., taking in the whole sweep of the curve by integrating throughout 196 STATISTICS the range of deviation x. We should then have, if areas and moments are equated according to Pearson's method, jydx=jy'dx, jxydx=jxy'dx, jx^ydx=jx-y'dx, . . .,jx"ydx=jx^y'dx, where y is the ordinate of the fitting curve corresponding to the ordinate y' of the observation curve. In practice, however, it is impossible to go to this limit : we cannot deal with an infinitely large sample, so we take as large a sample as is convenient, calculate the rough moments, N, N'^, N'2 . . •, and find approximately what corrections or adjustments are neces- sary to obtain the moments of the observation curve, a procedure which is really equivalent to the determination of the area of a curve when only a number of isolated points thereon are kno^sn. For the full analytical justification of the method of moments the reader is referred to Professor Pearson's original paper. On the Systematic Fitting of Curves to Observations and Measurements [Biometrika, vol. i., pp. 265 et seq. ; also vol. ii., pp. 1-23], where it is shown that ' with due precautions as to quadrature, it gives, when one can make a comparison, sensibly as good results as the method of least squares.' The latter, which is the traditional way of approaching all such problems, is shown to be impracticable in a large number of cases, either because the resulting equations cannot be solved, or, when they are capable of solution, because the labour involved would be colossal. Let us consider next how to deduce the area and moments of the observation curve from the statistics, in other words how to get jy'dx, jxy'dx, jx^dx, . . ., the integrals being taken throughout the range of the curve, when we know the frequencies corresjaonding to only a certain number of values or elementary ranges of the deviation x. Now the character observed may be capable of the deviations actually recorded and of no values in between, e.g. measuring deviations from ' no rooms ' as origin, we might have /^ one-roomed tenements, /g two-roomed tenements, /g three-roomed tenements, but there could be no such thing as a two-and-a-half or a three-and-a- quarter-roomed tenement ; on the other hand, any recorded devia- tion, Xf, may be only the mid-value (used as a convenient and concise approximation) of a group of observations including all in the continuous range from (Xj.— ^) to (Xy+^), where unit deviation is the class interval : thus we might have /^ males deviating by + 6 in. from 5 ft. (comprising all the males observed between 5 ft. CUKVE FITTING 197 5^ in. and 5 ft. 6| in.),/^ males deviating by +5 in. from 5 ft. (com- prising all males between 5 ft. 4| in. and 5 ft. 5| in.), and so on. These two cases must be discussed scjoarately. (1) When the observations are centred at definite but isolated values ofx. The problem is to find j.v"y'dx (the nth moment) when we have no definite curve given but we know the values of x and y' at a number of isolated points, say (■^'o> y'o)^ (-^"i. y'l), (•^•2' y'2)^ • • • > {'^p> y'p)- This is equivalent to discovering a suitable ' quadrature formula,' i.e. a good approximation to jzdx <-/■ Fig. (31). 0 h \ h 2 h3 p h {P^\]h > Fig. (32). in terms of knowTi points (x-o, So)' ('^i. 2=1), (.r., ^2). • • • (''^;.> 2j,). where we have Avritten z in place of .f"?/', and we may generally take the ordinates to be at equal distances, A, apart. Several such formulae have been suggested and they vary according as the 2's are situated at the ends (fig. (31)) or at the centres (fig. (32)) of the h intervals. The second type is perhaps the more useful of the two, and we shall work out one formula in illustration of it. Consider the first five of the given points, namely, (.To, eo)> K. 2i)> • • • (•^4> 24)- As a simple ' curve of closest contact ' let us find the parabola of type z=CQ^c^xl'h-^c.-^x-lh-^c^x'^llv'^c^x''l'h'^ . . (1) which goes through these five points, where the c's are constants to be determined. We may without loss of generality take the axis 198 STATISTICS Zn Cn 24 = Co + 2Ci + 4C2+8C3+16C4. of z to coincide with the middle one of the five ordinates, so that the known points on the curve become {-2h, Zq), {-h, Zi), {o, Zg), i+h, Z3), (+2;^, 24), and on substitution in (1) we get 2^0=^0 — 2C1 + 4C2—8C3+I6C4. 2;i=Co-Ci + C2— C3 + C4. 2:3 = Co + Ci + C2 + C3 + C4. These equations are just sufficient uniquely to determine the c's, and hence the paraboHc curve of closest contact, in terms of the five given points, but for our purpose it is not necessary to find all the c's. Suppose our object is to find the area of the shaded portion of fig. (33) in terms of the co-ordinates of the five given points. This area ■A O +/i +2/i • + /'/2 = / zdx ■hft ' + 1,12 ~^ -h/2 ^^" "^ Ci;c/A+ c.^x-/h- + c^x^/h^ + CiXyh^)dx +hl2 _ -hl2 CqX + Cix-I2h + C2X^I3h^ + C3xy4:h^-\-c^xy5h^ But the equations between the z's and c's at once give 22 = Co, 20 + 2:4 = 2(Co + 4C2+16C4), Zi+2;3=2(Co + C2 + C4). Thus Therefore 8C2+32C4 = (2o+24) — 2Z2 2C2 + 2C4 = (Zi + 23)-222 24C2 = 16(2i + 23)-(2o + 24)-3022 24C4 = (2o + 24)-4(2i + 23) + 622. Hence, by substitution, the shaded area becomes '•+hl2 2Zia;=^[z2+2¥sjl6(Zi + 23)— (2o + 24)-3022| + T9Vol(2o + 2;4) — 4(21 + 23) + 622I] =g^^[517822-17(2o + 24) + 308(2, + 23)]. r+hi2 J-h/2 (2) CURVE FITTING 199 these particular ordinates being appropriate when the axis of z coincides with the z^ ordinate. Similarly, it can be shown that r-f-3A/2 )i by finding the parabolic curve of closest contact through (0, z^), {h, Zj), (2h, Z2), {Sh, 23), the axis of z coinciding now with Zq. Now we require | zdx (see tig. (32)), and this may be obtained by splitting up the integral thus + + +...+ + J-hl2 hhft .'bhjl •(i'-?V< ■(p-i)!' and applying the formulce (2) and (3) to evaluate these sub-integrals. The first and last come under head (3), while all the rest come under (2). In fact, we fit together portions of curves of parabolic type based on the successive groups of points (0, 1, 2, 3), (0, 1, 2, 3, 4), (1, 2, 3, 4, 5), (2, 3, 4, 5, 6), . . . (p-4, p-3, p-2, p-l, p), {p—S, p—2, p—l, p), and as the points overlap, in the sense that neighbouring groups have points in common, the curves dovetail into one another and so provide a fairly good approximation to what we want in the way • of integral expressions giving areas based upon the positions of certain known points. We have, then :— f3h:2 Ji zcIx=--[21zq-\-11zi + 5z,-z^] J-h!2 24 zdx=—-[ollSz,- 17(2o-f~4) + 308(2i+23)] i3;i/2 o / dO i''"'\dx=^UollSz,-ll{z,+z,)-^Z()8{z,-hz,)] hhft o760 /•9/1/2 }> zdx=—-[o\18z,-\l{z.,+z^)^^m^^-VH)'\ .'nii 5760 j^''^''zclx=---[5ll8z,,_^- 17(2;,,_4+Zp)+308(2,,_3+Zp-i)] .'(p-i)h 5760 ■ " Therefore, H-i .-hu+± (8) 2' 2 > 240 To sum up, the general procedure in Case (2) is to calculate N, N\, N'2, N'3, N'4 directly from the statistics and so deduce v' 1, v'2, v'3, v' i- Then, transferring the origin to the mean, the v's become i/^, j/g, V3, v^ (see Appendix, Note 5), and finally the cor- rected /i.'s are given by /^l=0' /^2 = i'2 — i' /^3 = I'3' ^* = '^4 — ^''2 + 240 • These adjustments, originally due to Dr. W. F. Sheppard * [Pro- ceedmgs of the Lond. Mathl. Socy., vol. xxix., pjj. 353 et seq.], are applicable only when the curve of distribution has high contact at each ex- tremity as very frequently happens. To this case w^e shall confine ourselves, and when it does not hold the unadjusted moments may be used as a rough approximation failing a more refined but also a more intricate adjustment. The way in which the three chief kinds of average are related to [* To obtain Sheppard's adjustments we ha^ e followed the method indicated in Elderton's Frequency Curves and Correlation, pp. 28, 29.] 204 STATISTICS the fitting curve is of interest and deserves recapitulation. Whether the observations are classed as in Case (1) or as in Case (2) : — (1) the ordinate drawn through the highest point of the curve, since the frequency there is a maximum, fixes the modal value of X ; (2) the median X is determined by the ordinate bisecting the area between the curve and axis, since there are an equal number of observations on either side of it ; and (3) the mean is determined by the ordinate through the centre of gravity of the area between the curve and axis. We have still to show how to express the constants of the fitting curve in termsof the moments calculatedfrom the given statistics, and it will be convenient now to make our approach from the other end. Take the general equation of the fitting curve, express its con- stants in terms of its moments, and substitute for the latter the values determined from the statistics, since the basis of the fitting is the equalization of the moments of the observational curve and of the theoretical curve. This will enable us to determine k, the criterion for fixing the type of curve suitable to the given distribu- tion. When the type has been fixed it is, as a rule, not a very difi&cult matter to express the constants of the particular type again in terms of the observational moments. Now the general differential equation of the fitting curve was dy^ y{x-\-b) ^ dx px'^'{-qx-\-r hence j{px'-\-qx^r)dy=jy{x^b)dx, where the integration is to traverse the complete curve. Therefore, multiplying both sides by a;", j{px''+^+qx''+^+rx'')dy=j{yx''+^-\-byx'')dx] or, if we integrate the left-hand side by parts [(^a;"+'- + gra;«+i + ra:" )y] —jy{n-\- 2^a;"+i + 7i+ Iqx" + nrx''-^)dx =j{yx^+''^-\-byx^)dx. But the expression in square brackets vanishes at both limits if we suppose y to be zero at each end of the curve, so that the equa- tion reduces to {l+p'n^2)jyx''+^dx+{b + qu^l)jyx"dx+r)ijyx"-'^dx^0, ... (9) CURVE FITTING 205 Now if deviations are measured from the mean of the distribution, we have jyxdx='Nfjii^O, jyx-dx=Nij,2, \yxhlx='i:iiJL^, etc., and therefore, putting n=3 in the above relation, (l+5i))N/x,+ (6+4g)N/z3+3rN/Lt2=0 ; put n=2, (l+42))N/x3+(6+3^)N^o=0 ; put w=l, (1 + 3^)N/X2+^N=0 ; put w=0, {6+(?)N=0. Thus b = — q, and, on substitution in the other three equations, we get 5/x42>+ 3jLt3g+ 3^2r+^4 =0, Sfx^p + r+fji,=0, three simple linear equations to find p, q, r, the solution of which leads to (/ = — 6 = — /i3(/x4+ 3/A-2)/(10/x2/^4— IS/x^a— IV's). We have thus expressed p, q, r, and b, the constants of the fitting curve in terms of the moments of the observed distribution, but the results may be rendered more concise by WTiting ^i=fiyH'% ^■i=H'i'H-'2' ' • • (10) whence p — (2^2-3^i-6);2(o^2-6^i-9). .... (H) g=-6 = -V(i^-A)-(i82+3)/2(5^2-6^i-9), . . (12) r = -/i2(4i82-3j8,)/2(5^2-6i8,-9) .... (13) And K, the criterion for fixing the type of curve suitable to the statistics given, is immediately deduced from K=q-/4:2)r =^,{^,+Sr/4{4^,-3^,){2^,-S^,-6) . . . (14) Also, since — vanishes when x = — b, this fixes the mode relative dx to the origin. But the origin is now at the mean, so that mode-mean=-6 = - V(/^2i8i) • ()S2+3)/2(5j82-6j8i-9) (15) And skewness = (mean— mode)/S.D. = 6/V(M2) =Vi3i(i82+3)/2(5iS2-6^,-9) . . . (16) CHAPTER XVII APPLICATIONS OF CURVE FITTIJfG We are in a position now to test the application of these principles to given frequency distributions and we shall start by trying to find a curve to fit the record of marks obtained by 514 candidates in a certain examination (see p. 25). Example (1). — This example is chosen because it turns out, when we come to evaluate k, that it is well fitted bj' the normal curve, Type VII, which is one of the simplest and at the same time the most important of all the tyipes discussed. Before we start the numerical part of the work it will be well to express the constants y^ and a of this curve in terms of the moments of the distribution. The equation of the normal curve is X- y=yoe"2^'. If N be the total frequencj^ wc have by equation (4) bis, p. 202, r-fco N=l ydx J -co ^Voj e-='''-''-dx. J -co dx Put x-l2a-=^^-. so that — =ct\/2 and when a:=00, |=00 also. Thus N=?/oa\/2 e-'V| .'-co z=y(^aV2V7T (see Appendix, Note 8) = V(277)c72/o . . . (1) ?06 Again APPLICATIONS OF CURVE FITTING /■+00 / r + ca /^2— / yx-dx \ ydx J-(X> I J-oo 207 2j/„ "2"' N since vanishes at both limits. Therefore, yL^ = '\/'2 . ay^VTr . cr-/N=cT-, by (1). In fact, a is simply the S.D. of the distribution. And yo=N/\/(27r) . a. Table (41). Distribution of Marks obtained by 514 Candi- dates IN A CERTAIN EXAMINATION. Mean No. of Marks. Deviation from 33. Frequency of Candidates. First Moment. Second ISIoment. Third Fourth Moment. Moment. 3 8 13 18 23 28 33 38 43 48 53 58 63 {X) -6 — 5 -4 -3 _2 -1 +'l + 2 + 3 +4 + 5 +6 (/) 5 9 28 49 58 82 87 79 50 37 21 6 3 ■(fa) - 30 - 45 -112 -147 -116 - 82 + 79 + 100 + 111 + 84 + 30 + 18 (fa') 180 225 448 441 232 82 79 200 333 336 150 108 (fa') -1080 -1125 -1792 -1323 - 464 - 82 + "'79 + 400 + 999 + 1344 + 750 + 648 (fa*) 6480 5625 7168 3969 928 82 79 800 2997 5376 3750 3888 — — 514 -110 2814 -1646 41,142 208 STATISTICS The first 4 moments referred to 33 as origin and with the class interval, 5 marks, as unit of deviation, are -110/514, 2814/514, -1646/514, 41142/514. The arithmetic mean of the distribution =33+5:k =33+5(-ii|) =33-5(0-214008) =31-92996. The second, third, and fourth moments referred to the mean as origin, and retaining five marks as unit of deviation, are given (see Appendix, Note 5) by 7;2=2814/514-.t2_5-42891 1,3=- 1646/514- 30=1-2-^3 =,0-29296 After making Sheppard's adjustments these become /X2=5-34558, /z3=0-29296, )U4-76-11436. Thus iSi^/x-a/jiASg =0-00056, ^2=/^4//>t-2=2-66365. Hence «=^i(^2+3)V4(4i32-3^i)(2^2-3j8i-6) =(0-00056)(5-66365)2/4(10-65292)(-0-67438) = -0-00063. Since k and jS^ are small and ^2 ^o^s not differ greatly from 3, making p and q small, we may fit a normal curve to this distribution. The appropriate normal curve is where ct-=/X2 =5-34558 (5 marks as unit), 2/o=N/V(277/x2)=514/\/277^(5-34558)^=88-6903. Hence the required curve has for its equation, writing results to three significant figures. Now the mean of the distribution is at 31-92996, where the central ordinate of the normal curve is erected, and the distance of any x, say ^33, from this point = (33— 31-92996)/5 (expressed with 5 marks as unit) =-0-214008. APPLICATIONS OF CURVE FITTING 20d Any other x may be found in the same way and y can then be deduced from the equation of the curve by taking logs, thus log.,,=log„88.6903-^-^^^log„e =1-9478762- (0-0406218)x-. This enables us to calculate the ordinates of the normal curve and thence we could evaluate the areas by successive applications of a suitable quadrature formula. We can, however, get the areas direct by using a table of the probability integral, such as that due to Dr. W. F. Sheppard (see pp. 284, 285). In that case the corresponding abscissae have first to be expressed in terms of the standard deviation as unit, e.g. a;4Q.5=40-5-31-92996=8-o7004, and CT=5V'(5-34558) = 11-56025, where the factor 5 is introduced because 5 marks was the unit in the calculation of /it, (a process equivalent in effect to that previously adopted). Thus a;4o.5/a =0-741336 =i, say. The area of the normal curve up to the abscissa xja or ^ = 1 ydx .'-co J ~OD ■'-"°V277 =n/^ zd J -ca =N . A(l+a), 1 where - represents the area of the curve z= — =e~^^'^ between 2 V277 0 and ^. 210 STATISTICS Sheppard's Tables give the values of |(l+a) for different values of ^, and \\ hen ^=0-74, i(l+a) =0-7703500 ^=0-75, 1(1 +a) =0-7733726. Therefore, by interpolation, when ^=0-741336, i(l+a)=0-7707538. Thus the frequency of candidates with marks lying between 0 and 40-5 =514(0-7707538) =396-17. Similarly the frequency of candidates with marks lying between 0 and 45-5 =452-20. gyOy - ~r " - - ... ? Li-f--i 5 . . /Jr:-5i _ .. _ _ _ 'S^ . ._ - ,? !-- .: - - 6 o?r ± . ^.j.-t. .\ - X It _ " I^fi .,- -^ -. ^ X 2 ^ *-S-- . - . .. " S" ' i 1 ^t- --- " " ft" , .... I -, . _ . - . ±"'S" z/_. s. _ .--_. ' Sfin i/ -.S ... z'S^i'S-'. :. - ±.-., -- ^ -- 5^ ,7 . , . r . _ .. ._. __ f. i j^^H is£ _ .- :i : :_- - .-a " -><§ _ ._ ,1 __ .__ _; .. _ _ ^ _. _- — J ^ ^. - - -- is - --W ._ - .. - ^ . - -. . " ±u i^ - i ia In ,' - -^. - --- ""ig' " \ " — :::::"": i — " "" : :: : : ijSr : : ■ __- ->.---- ±s: .^ I _ _ _ _ __ L . _ . .. iS Z _ - -.. - : , . {§ ^ _.---- - -St""' z " ' "--' ' _ ^:":""::~ : " " j? "" -- j'l z ' :;:. _;::: : : :+: :: " ^5:" ---{'- : i: : ;:: :::: : : .:_5;:: : ;; feiJIIIl-lllllllllllllllllllllNlllllllllllllllllllllllllllllllllflWWIJ 20 30 40 Marks obtained Fio. (35). 50 60 70 Hence the normal frequency for the group with 43 as mean number of marks =56-0, and the same method gives the area for any other group. The histogram of the observations and the curve plotted from the ordinates are shoA\Ti together in fig. (35). In Table (42) are set out the calculated normal frequency (col. (4)) for each group alongside the corresponding observed frequency (col. (2)), and the differences between the two are shoMH in col. (5). We want to know whether the fit is a good one. APPLICATIONS OF CURVE FITTING 211 Table (42) Comparison of Observed and Normal Frequencies in Examination Example. (1) (2) (3) (4) (5) (6) (7) Meau No. Normal Frequeiic}-. Ratio of No. of Marks. Observed Deviation. Sij. of in Col. (6) to No. in Col. (4). Frequency. Ordinates. Areas. Deviation. 3 5 3-9 5-7 + 0-7 0-49 0-09 8 9 10-4 10-7 + 1-7 2-89 0-27 13 28 23-2 23-5 -4-5 20-25 0-86 18 49 42-9 43-1 -5-9 34-81 0-81 23 58 65-8 65-6 + 7-6 57-76 0-88 28 82 83-7 83-1 + 11 1-21 001 33 87 88-3 87-6 + 0-6 0-36 0-00 38 79 77-3 76-8 — 2-2 4-84 006 43 50 561 56-0 + 6-0 3600 0-64 48 37 33-7 34-0 -3-0 900 0-26 53 21 16-8 17-1 -3-9 15-21 0-80 58 6 7-0 7-2 + 1-2 1-44 0-20 63 3 2-4 3-5 + 0-5 0-25 0-07 •• 514 511-5 513-9 •• 184-51 X^ = 504 Now with this object we might square each difference as in col. (6), sum the squares, and find the mean square deviation by dividing by the total frequency ; this, after extracting the square root, would give what might be called the root-mean-square error, regarding the theoretical values as the true ones. In the above example it =V(184-51/514)=0-599. But this form of result, while it may be useful in some cases, e,.g. in comparing two distributions of the same kind to some theoretical series, is open to objection ; for one thing it treats all the differences as if they Avere of equal importance in absolute magnitude, but a difference of 2, say, in a normal frequency of 10 is clearly more serious than a like difference in a frequency of 60. The objection, however, goes deeper than that ; even when the root-mean-square deviation is found we are at a loss to estimate its precise relationship to the quality of fit, as there seems to be no definite connection between one distribution and another of a different kind : there is no standard case, so to speak, to which we can always appeal, where the fit is agreed to be good and supplying therefore a suitable root-mean-square deviation for comparison. w-v-^'"^- "v.^-^-Tf e-- ; (^^-r 212 STATISTICS This leads us to the question : What constitutes goodness of fit ? Suppose by some means we have selected a theoretical or empirical formula to describe a certain frequency distribution in a given population ; if the frequency values observed do not diflfer from the theoretical frequencies by more than the deviations we might expect owing to random sampling, then clearly the fit may be regarded as a good one. And we have a measure of the fit if we can find the proportion of random samples, of the same size as the given distribution, showdng greater deviations from the distribu- tion given by theory than those which are actually observed. Now Professor Karl Pearson has shown how this proportion can be calculated \Pliil. Mag., vol. 1., pp. 157-175 (1900)] ; he finds the probability that a random sample should give a frequency distribu- tion differing from that Avhich theory proposes by as much as or by more than the distribution actually observed. This probability, P, is a function of ^, where y and y' representing the theoretical and observed frequencies for any particular group and the summation is to include all groups. It will be noted that this expression gives each difference {y—y') its appropriate importance by relating it to the frequency y of its own group. A table in Biometrika (vol. i., pp. 155 et seq.) gives the values of P corresponding to different values of y^ (including all integral values from 1 to 30) and to values of n' , the total number of frequency groups, from 3 to 30. (see also p. 285). The mathematics in- volved in finding P is difficult, and the reader who wishes to enter into it must consult the original memoir, but the utility of the function has been proved by experience and it is readily applied in a particular case.^^<,^£:i-^*'^''^S In the above example p^^ is found from col. (7) : it equals 5-04, and from the table of values of P, when n' =13, we have P =0-957979 when ^^=5, and P=0-916082 when x'=^- Therefore, by proportional interpolation, when ^-=5-04, P=0-956303. Thus, supposing our data to follow the normal curve, in 956 random samples out of 1000 we should expect to get a worse-fitting distribution than that given by the sample actually observed. We may therefore conclude without hesitation that the normal curve provides an excellent fit in this particular instance. APPLICATIONS OF CURVE FITTING 213 We pass on now to fresh distributions to illustrate some of the other types of frequency curve. Example (2) deals with the percentage of trade union members unemployed at the end of each month for the years 1898 to 1912 [data from the Sixteenth Abstract of Labour Statistics of the United Kingdom, Cd. 7131]. Table (43) shows the distribution of the 180 records according to the percentage unemployed. The deviations are measured from the centre of the group (3-9— 5-2) as origin, and the class interval (1-3 per cent.) is taken as unit of deviation as usual. The first four moments are : — -29/180(=a-), 425/180, 397/180, 3053/180 ; i.e. -OlGllUl, 2-3611111, 2-2055556, 16-9611111. Table (43). Distribution of Unemployed Percentages OF Trade Union Members Percentage Devia- Fre- First Second Third Fourth Unemployed. tion. quency. Moment. Moment. Moment. Moment. 0— -3 0 0 0 0 0 1-3— _2 33 -66 132 -264 528 2-6— -1 57 -57 57 - 57 57 3-9— , , 41 , , . . 5-2— + 1 24 + 24 24 + 24 24 6-5— + 2 10 + 20 40 + 80 160 7-8— + 3 11 + 33 99 + 297 891 9-1— + 4 3 + 12 48 + 192 768 10-4— + 5 1 + 5 25 + 125 625 •• •• 180 -29 425 + 397 3053 Referred to the mean, 4-55 -fl-3.€- =4-3405556, the second, third, and fourth moments are (see Appendix, Note 5), ,;2=2-3611111-.f2=2-3351543, r3=2-2055556-3:ri/2-:c3=3-338395, r4=16-9611111-4i;-;'3-6:cV.2-.f*=18-74817. Owing to the very doubtful contact at the beginning of the curve Sheppard's adjustments were not made in this case, but the rough moments as calculated above were used. 214 STATISTICS Thus jSi =7.2 jj,3^ ^0-875242 iS2=/V'^'2=3-43817 and /.=iSi(|8.3+3)V4(4^o-3^i)(2^2-3^,-6) = -0-466. Since k is negative the fitting curve should be of Type /.,the equation of which is y=yo 1+- where mja-^=7n2la.2, and (ai-l-ao)=6, say. It is therefore necessary before going further to determine y^, a^, tto, b, m^ and m^ in terms of v^, v^, v^, or jS^ and jSa, the constants of the distribution. The value of y^ is found to be most conveniently expressed as a Gamma function which is defined, with the usual notation, thus : — r{n) = j x'^-'^e-Hx, whence it follows that T{k-^\)=kT{k). [See Appendix, Note 9, also p. 285.] Also, if B(m, n) = \ .r™-! {l — xy-'^dx it may be easily shown that B(m, 7i)=r{m)rin)/r{m+n). [See Appendix, Note 9.] The general method of procedure in determining the constants for all the different types is : — 1. Express the fact that the area of the curve is a measure of the total frequency of the distribution — this enables us to find ?/o. 2. Find the nth moment of the curve with regard to some fixed ■origin— giving ?t particular values, 1, 2, 3, 4, this leads to the determination of jXo, /Xg, jx^, jS^, jSg in terms of the con- stants of the curve, and thence to formulae for calculating the constants. Once found, the same formulae may be used, of course, in all cases of the same type : we have only to replace letters by the numbers for which they stand. Applying this method to the Tyj)e I. curve, we have [ + "■2 N= ydx J -a-i i-ai I- + 02 a.^tto ^" ' {a.i-xY'Ha^-xY'^dx APPLICATIONS OF CURVE PITTING 215 Put {a^-\-z) = {ai-\-a.^)z, so that {a2—x) — {ai-\ a2){l—z) and -J- = (aj+a^'>)=6 ; therefore dz a^"'>a.^"'-^ Jo (2) .l"'^W2./'2 BCm,+ l,W2+l). m{ 'm.2 Hence yo= . --^ — ~ — — • b (m,4-m2r+"'^ r(m,+l)r(m2+l) Again, %'„=/ ?/(ai+a:)V^ • -"i is the nth moment of the distribution referred to (— a^, 0), the point where the curve starts from the axis on the left-hand side, as origin. Therefore, as above, = 6"Nf 2"'i + "(l-2)'"2C?2// 2"'i(l-2)™2c?2, by (2). .'o / .'o Hence, ^'"=6"r(Wi+w+l)r(mi+m2+2)/r(?Wi+l)rK+?W2+w+2) = 6"(mi+n)(mi+w— 1) . . , (mi4-l)/(^^i+^2+^+l)(^i+*^2+^) . . . (mi+m2+2), by repeated application of the relation r{k-\-l)=kr{k). Putting w=l, 2, 3, 4 in succession, we have ^'j=6(mi+l)/(mi+m2+2), /Lt'2=6^(mi+2)K+l)/(mi+m2+3)K+m2+2), ^'3-63(mi+3)K+2)(mi+l)/(mi+m2+4)(mi+m2+3)K+m2+2), /^=6*(mi+4)(mi+3)(mi+2)(mi+l)/K+m2+5)(mi+m2+4) (mi+?W2+3)(mi+rrt2+2). These relations are rendered more concise if we ^^Tite 7ni-]-l—m\, 7712+1 =w'2, 7Wi+W2+2=r; thus iJL.\=b'm\/r fji\^b-m\{m\+l)lr{r-^l) fi\=b^m\{m\+l){m\-\-2)lr{r+l){r+2) ^'^=6W^(w'^+l)(m'i+2)(w'i+3)Mr+l)(/-+2)(r+3). 216 STATISTICS To get the corresponding moments referred to the mean as origin we have the relations : — which, after some straightforward reduction, give H2=b-m\m' Jr-{r+l) fjL^=2b^m' I'm' ^i^n' 2—m' i)/r^{r-^l){r-ir2) ^^=3/;Wim'2[m\w'2(r-6)-f2r2]/r*(r+l)(r+2)(r+3). Thus i3,=^^//x32^^*'^"^^^"^^^'^-^'^)' /b^m'\m'\ =4(m'2-m'i)2(r+l)/m'iw'2(r+2)- =4:{r^-47n\m' .2)ir+l)/m\m' r,{r+2)\ Therefore, _Il_ =M±?1V4 . . . (3) m\m'2 4(r+l) ^ . ^ , , Sb*m\m' Jm\m' Jr-Q)-\-2r-] /b^'\m'K Agam, iSo =fj-i iJi-o =- ^^ — - — ^^ — — / — 3[m\m\{r-6)-\-2f-] (r+l) m\m\ {r+2)(r+3) Therefore. J!% =.-r-,6+^^'^^^ ... (4) w'im'2 3(r+l) Combining ,3) and ,4). '^^+S=0-r+^^^^^, 4(r+l) 3(r+l) whence r=608.-i3,-l)/(3^,-2j3,+6) . . (5) Again, since /x2=6-m'im'2/r-(r+l), therefore 62^/^2(^+1) • [i3i(r+2)2+16(r+l)]/4(r+l), by (3), i.e. b^^V/I^A [iS,(r+2r+16(r+l)] .• . (6) And m',w'2=4/-2(r+l)/[^i(r+2)- + 16(r+l)], while m\-]-m'2=r ; hence w'^ and m'2 are roots oi 4r2(r+l) WI-— rm+ - =0, ^,{r+2)'+ie(r+l) r lY 167--(r+l) the solution of which quadratic is - ± ^ / -- j8,(r+2)^+16(r+ .} APPLICATIONS OF CURVE FITTING therefore, m^ and m^* are respectively er[ual to _2+ r(r+2)Vj3, and ttj and a.^ follow from a, a, b 217 ^ 1 . m. m^ nil + nig (7) (8) Applying these formulae to the ' unemployed ' example, ^we find r=5-36048. 6=9-33236. =0-169185. m2=3-191295. a2=8-86252. a^ =0-469842. Also ?/o=58-1282, and the equation of the curve is therefore y = 58-l 1 0-470 X 8-86 The position of the origin, M'hich is at the mode, is given by {mean-^node)=/z'i— ttj _bm\ brrii _^/m'i_m\-l\ V r r-2 / =6 m m r(r-2) (9) thus, r+2 mode =4-3405556- 1 . ^ , i^2 '■—2 in this particular case, =2-3052009. [* When Mj is positive m.2 goes with the positive root of the quadratic, and vice versa.] 218 STATISTICS This enables us to Avrite do\\ai any x, and thence y by substituting for X in the equation of the curve, which, by taking logs, may be written log y=\og Vfi+m^ log l + _ j+mg log ( 1- e.g. for the x of the group (2-6— 3-9), bearing in mind that 1-3 is the unit of measurement for x, we have a:3.25=(3-25-2-3052009)/l-3=0-9447991/l-3. Hence ( 1+' ^^^ ) =2-546835 ; ^^M =0-9179953: a, ,/ m, log ( l+'^:25 )_o-0686892 ; w^ log 1-'^:^ =-0-118587 ; \ «! ' \ «2 / . so that log 1/= 1-7 14489, and 2/3-25=^1 '^2. Similarly the ordinates at the centre points of the other groups may be calculated, but it must be remembered that the resulting values are only a first approximation to the observed frequencies, and a better series is obtained if, by using some good quadrature formula, we calculate the areas for the successive groups between the curve, the bounding ordinates, and the axis of x. Indeed in the case of the group (1-3— 2-6) it is essential to do this, because (1) the rise of the curve is so very abrupt as to render the deter- mination of the single ordinate at the centre quite inadequate for an accurate measure of the frequency in that group, and (2) a portion of the group falls outside the range of the curve which only starts at 1-6944063 (i.e. mode— l-3ai), and this has to be allowed for in finding the frequency as represented by the area between the curve and axis. The base of the required area, range (1-6944063 to 2-6), was therefore divided into eight equal joarts and the ordinates at the points of division were determmed. The area was then found by using Simpson's well-known formula : — Area=iA[(t/o+2/op)+2(?/2+?/4+ . . . +2/2p-o)+4(?/i+?/3+ . . . +y2;,-i)]> where h denotes the length of one of the equal parts into which the base is divided and 2^ is their number ; in our case ^=4 and /i=:l the class interval being the unit, and the result is to be reduced in the ratio 0-9055937 : 1-3 APPLICATIONS OF f'lTRVE FITTING 219 in order to allow for the smaller range of this group ; we thus get as the area for the group -^^^^xJr(yo+y8) + 2(y2+v/4+y6) + 4(2/i+Z/3+y5+y7)]=37-39. 1-3 24 The observed and calculated frequencies for the whole series are compared in Table (44), the remaining areas in col. (4) being calcu- lated by the simpler but somewhat less accurate form of Simpson's formula, when only three ordinates are used, namely, ,■+1 I yr/.r=i(i/_i + 4//(,+2/i). Table (44). Comparison of Observed and Theoretical Frequencies of Unemployed Percentages (1) (2) (3) (4) (5) (6) (7) Percentage L' nam ployed. Observed Frequency. Theoretical Oidinates. Frequency. Deviation. Areas. Square of Deviation. Ratio of No. ill Col. (fi) to No.inC..l.(4). 1-3— 2-6— 3-9— 5-2— 6-5— 7-8- 9-1- 10-4- 33 57 41 24 10 11 3 1 55-3* 51-8 37-8 24-9 14-8 7-7 3-3 1-0 37-4 51-6 37-8 25-0 14-9 7-8 3-4 1-2 +4-4 -5-4 -3-2 + 10 + 4-9 -3-2 +0-4 +0-2 19-36 29-16 10-24 1-00 24-01 10-24 0-16 0-04 0-52 0-57 0-27 004 1-61 1-31 0-0.5 003 •• 180 •• 179-1 x2=4-40 To test the goodness of fit we have n''=8, ^-=4-40, whence, by means of the P table, P=0-731852. Thus, roughly, Ave may say that three out of every four random samples of 180 records would give a worse fit with the proposed curve than is given by the actual distribu- tion observed, so that the fit may be regarded as quite a reasonably good one. This conclusion is also supported by an examination of the curve which has been drawn, fig. (36), with the histogram of the given statistics. Example (3). — The data for this example concerning infectious diseases will be found in Table (16), p. 62 (or, see p. 224) ; the reader should work out the moments for himself and verify the folloM'ing results : — [* The ordinate in this case cannot be accepted as an approximation to the frequency given by the curve. ] 220 STATISTICS The first four moments referred to 7 as origin are 0-282158, 4-86307, 17-4855, 129-394. Referred to the mean, 7-564316, the three latter become 1/2=4-78346, 1/3 = 13-4140, j/4=lll-964. // we do not assume high contact at the terminals, and certainly at the lower end it is doubtful, we deduce from the above values of the moments that ^1 = 1-64396, ^2=4-89321, /c = -l-53. Thus the fitting curve is of Tyjpe I. and its constants, when calcu- lated, are y=ll-7819. mi=0-31171. ^2=9-47020. aj =0-79216. ^2=24-0671. 2/o=60-363. ou -- _ z !r_ -^ _, _- _l r-h- hi^H ^ pl -- - U ^ u U- -- -, _ _ -. ^- - — y ' ^ - "1 ^ ~ " " " " " ~ ■^ ~' ~~ " ~ ~ " H + 50 \ ^ ^^ 1 s, o _ - -. y- \ M -. J L - - - ^ _L - J - __ !^40 - ^ --T T - - - - -- - -' r n -^ - -- =>30 1 : :: _ -J . __ _ _ „ _ ^ - _ __ _^ ^ - - -■ t - - 1 -- -- - - H - -J ^ - - -- -- ^ - - -- 3i 1 o \ c ' ' N 2^20 ' ' 1 ■ ' "s — i ' ■« ■i.0 _ ^ _ ._ _ ^ |, _ __ ^ _ __ . + _ J! s - _ __ __ J _ _ -_ - ■- -- ■^ - - ■-^ T - ■- - '■-H ^ v^ - - -- ■■- - - - -- ~~l *. . J - : :i = - L -'- : - i. 3 n) ~.z -; ;- : -_z - _ - J - " ? = s ^ s = f Y. 0 1 2g3 425 6 7 8 9 10 11 12 Percentage Unemployed Fig. (36). The equation of the curve is therefore, retaining three significant figures throughout y=60-4 1+-^ 1-^ • ^ 0-792/ \ 24-1,/ The curve starts at 2-02904 (so that the first group of observations lies wholly outside its range) and ends at 51-7475. It is drawn, together with the corresponding Mstogram, in fig. (37). Supposing, just for the sake of comparison, we assume high contact at the terminals and attempt to fit the given distribution with a Type III. curve, to which Type I. is closely related. We then have, after making Sheppard's adjustments, /X2=4-70013, /X3=13-4140, /X4=109-601, whence ^i=l-73295, ^82=4-96129, /c = -l-47. It will be noted that the theoretically correct type to take here again is Type I., but this was discarded because, when attempted, APPLICATIONS OF CURVE FITTING 221 it led to a curve starting at a point corresponding to a disease rate of 3-385, so that the central ordinates of each of the first two observed groups lay outside the curve altogether. Type III. curve is of the form y 70 60 ■^50 a 40 to 20 tM 10 ■ ■■ ■" " ■■ ■ - ::i::i!^-i::::::"::::i::i::::::i::i::::::::"g:i::::":: ----/::2::s:::::::i::::i::i::::::+:i:::::i:::i:i::i::::i::: ±-/-%, - \ ^ 2 . \\ ~X 4 ^ / \ V Tvnp I :fc-: \\~ — TZ , -A -^Type-Ill. -::- :S;,:::i:ii::="=== -JiiiS::: : :::+:+:-: _.._ __ L ^^ _ _ __ _ _ :::: ::::: ::::::-:!^r::=::==^======:=:==========:====:==:=: ___ "-^"^S 3: J i :: i:._-t :: ^- :-35^: i:i^:::±i::::::::::i::::^^_::i:::::::::::::::::i:::::::S::: -— -+ 1 5; ± ::: ]- — ^r :i 1 -N^; 1 ' 1 Srf§ r§ TT*ti°+*44-U-l-L 1 1 1 1 1 1 10 15 20 Disease Rate per 1000 persons liu'mg Frn. (37). 25 30 To express the constants in terms of the moments, noting that the curve starts from x = — a on one side and goes ofif to infinity on the other, we have -co N--/ ydx J - a _Jo I e-yx^a-]-xYdx (where ya=p) a J-a ^y^-i e-y^ya+yxfdx =J\ ev« /■ e - ^■^"+y'\ya-\-yxYdx P J-a =^^ I e'^z^'dz (where ya-i-yx=z) yp^Jo =''^€r(;>+l). Therefore, y„=Np''+VaeT(p-|-l) (10). 222 STATISTICS Again, the nth moment of the distribution referred to (—a, 0) as origin is Therefore, by (10), -1 /n Hence, =r(2?+w+i)/y"r(p+i). li\=T{p+2)lyT{v+l)Mv+\)ly /x'2=r(p+3)/y'^r(i>+l)={i>+2)(i>+l)// )a'3=r(p+4)/y3r(i)+l) = (i>+3)(;)+2)(^+l)//. Transferring to the mean as origin we have for the moments, since /i2=/2-*^ = (P+l)/y'" /^3=/^'3— 3^/i2— ^^=2(jj+l)/y3. Hence, combining these last two equations, y=2fijfi,, p^lW/^'a)-! • • . (11) In our particular case these equations give y=0-700780, 2^=1-30820, a=l-86678, and, therefore, by (10), yo=55-3323. Hence the curve is y=55-3e-°"Wl+ — ) . V 1-87/ The equation of the curve, on taking logs, gives log 2/=log y^—y log lo^ • x+p log ( 1+^ =l-742979-0-304345x+ 1-30820 log (l + x/1-86678). = -(i^+l)y-i' 'ly = ' mode = =7-564316- 2-853960 A •« — Mode Mean APPLICATIONS OF CURVE FITTING 223 Before Ave can go on to calculate the ordinates of the curve we need to know where the origin lies, and since it coincides with the mode it may be found from mean— mode =/x' ^ — a . (12) Thus, mode=7-564316-2-853960=4-71036. Suppose now we wish to calculate the ordinate corresponding to the X of the centre point of group (6—8), we have a;7=i(7-4-71036) =1-14482, bearing in mind that the unit is a rate of 2 per 1000. Hence, substituting this value in the equation for log y, log 7/7 = 1-666278 ?/7 =46-374, and similarly any other y may be found. The curve starts at mode-a=4-71036-2(l-86678)=0-97680, so that the range of the first group as determined from the curve is (0-9768-2), and not (0—2) as in the observations. The ordinates and afterwards the areas, calculated by a method somewhat similar to that indicated in Example (2), were determined for each separate group of observations, and the results for both Tjrpe I. and Tjrpe III. curves are compared in Table (45). Type III. curve is drawn on the same diagram, fig. (37), as Type I. curve and the observation histogram, and the result lends emphasis to an important jDoint, namely, the necessity for replacing ordinates by areas to obtain the frequency proper to any group. In order to get a measure of the goodness of fit in each case, the function P was calculated, but in the Type I. comparison the first group had to be omitted to avoid the infinite term which would have resulted in v^^^, owing to this group falling right outside the curve, that is to say, the test had to be confined to toA\Tis in which 224 STATISTICS the observed ease rate was not less than 2. The values found for P were : — Type I.— P=0-34307, Type III.— P=0-46298, so that in every 100 samples containing 241 observations each, we should get, roughly, 34 deviating from the Type I. curve and 46 deviating from the Type III. curve, at least as widely as the given distribution. In neither case can the fit be regarded as a very good one, but the failure is only marked in one or two groups, such as that of maximum frequency, where there may be other than random causes to account for it ; e.g. where isolation is inefficient the disease is likely to spread, one case infects another : in other words, the events are not independent. Table (45). Comparison of Observed Distribution of In- fectious Disease Rates, notified in 241 large Towns of England and Wales, with Theoretical Distribution. (1) (2) (3) (*) (6) (6) Observed Frequenc}-. Theoretical Frequency. Case Rate. (fi-f)Vfi. {fs -/)■//,. Tvpe I. Type III. 0— (/) 5 (/x) if.) 6-6 0-39 2 39 52-6 43-7 3-52 0-51 4— 69 55-4 54-3 3-34 3-98 6— 41 43-2 46-2 Oil 0-59 8— 29 31-2 33-6 015 0-63 10— 22 21-5 22-4 0-01 001 12— 16 14-2 141 0-23 0-26 14— 7 91 8-6 0-48 0-30 16— 5 5-6 51 006 000 18— 3 3-3 2-9 003 000 20— 4 1-9 1-7 2-32 3-11 22— 0 10 0-9 1-00 0-90 24— 0 0-5 0-5 0-50 0-50 26— 1 0-3 0-3 1-63 1-63 •• 241 239-8 240-9 X\ = 13-38 X% = 12-81 Example (4) refers to the wages of certain women tailors previ- ously recorded in Table (11), p. 41. The data as given in the original suffered a disadvantage common to such statistics : at APPLICATIONS OF CURVE FITTING 225 either end the grouping differed from that in the centre, two or three classes being lumped together owing to the smallness of frequency in each. The figures ran thus : — Under 5s., 19 ; 5s. and under 6s., 180 ; 6s. and under 7s., 384 ; ... ; 23s. and under 24s., 64 ; 24s. and under 25s., 54 ; 25s. and under 30s., 122 ; 30s. and over, 36. They were recast in the form sho\\Ti in Table (46), suggested by an examination of the histogram, in order to make the fitting simpler. The first four moments calculated from this adapted table and referred to 12s. as origin are : — /i=0-556718, i;'2=5056373, ^^'3=16•70163, i.'4=123-7691. When referred to the mean, 13-113436, the last three become 1/2=4-746438, i/3=8-60179, i/4=95-6914 ; or, after making Sheppard's adjustments, /^2=4-663105, /Lt3=8-60179, /m4=93-3474; therefore, ^1=0-729713, /S2=4-29291, «=l-63. The curve is thus of Type VI., y=yo(x-a)iVx'". To calculate the constants, the nth moment about the origin is given by =2/o j{x—af'\v''-'''dx •0 ,^1-2)"^ a-31 / 1\^ / ^ a - — • — -(i{ — ^ )dzl where x=- d'li-'.-z-n-l Thus, putting n=0, N=-^B(<7i-^.,-L (?3+l) . . . (13) and ^'^=a-riq,-q,-l-n)r{q,)!r{q,-n)r{qi-q,-l); therefore, fi\=ar{qi-q^-2)r{q,)/r{qi-l)r{q^-q.,-l} =a(q,-l)/{q,-q,-2). Also ^' Jix\_^=ar{qi-qi-l-n)r{qi-n+l)iriqi-n)Tiqr-qi-n) =a{q^-n)/{qi-q2-n-l). 226 STATISTICS ^Blice fM',^aHq,-\){q,-2)l{q,-q,-^2){q,~q,-3) IJi',=a^{q,-mh^2)iq,-3)l{q,-q,-^2){q,-q,~3)iq,-q,^A) (?i-?2-4)(?i-c-=iL___ ^ r ^ 1 " ::= Oi s d w kxi ; „ ^ Number of Sepals Fid. (I^O). 10 12 where log i/o=9-38179. The origin is at 4-27 and the mode at 6-04. The greatest frequency is 620 approximately, and the frequency dis- tribution, calculating areas for the several groups as if they ranged between (4-5— 5-5), (o-5 — 6-5), etc., is shoMTi alongside the observed 230 STATISTICS distribution in Table (47). The curve is plotted in fig. (39) from the ordinates which were calculated at the centre and extremities of each group so as to enable Simpson's simple quadrature formula to be used to get the areas. Table (47). Distribution of Sepals of Anemone Nemorosa, observed and calculated. [Examples have been given above of five out of the seven different types of frequency curve that have been enumerated. For further examples of .all the types and a complete account of the method reference should be made to Professor Pearson^ s memoirs, especially the following : — Roy. Soc. Phil. Tram., vol. 186A/pp. 343-414 (1895), On Skew Variation in Homogeneous Material ; and a Supplementary Memoir in vol. 197a, pp. 443- 459 (1901). Biometrika, vol. i., pp. 265 el seq., On the Systematic Fitting of Curves to Observations and Measurements, continued in vol. ii., pp. 1-23. Also vol. iv., pp. 169-212, which discusses various historical hypotheses made to generaHze the Gaussian Law, the basis of the symmetrical normal curve. A large number of highly interesting practical illustrations of Pearsonian curve fitting occur throughout the pages of Biometrika, while W. P. Elderton's Frequeiicy Curves and Correlation contains an admirably concise treatment of the theory, with applications to meet more particularly the actuarial point of view. It should be stated that rival curves and methods have been proposed as suitable for fitting certain t%-pes of frequency distribution, some of which have scarcely received the attention and the trial they deserve. Among the most interesting are those developed by Professor Edgeworth ; for some account of his voluminous work upon the subject the reader may refer to several memoirs in the Journal of the Royal Statistical Society, beginning December 1898 (the Method of Translation), among which the following are important as giving more recent results of his researches : — ■ Vol. Ixix. (1906), The Generalized Law of Error or Law of Great Numbers. Vol. Ixxvii. (1914), On the Use of Analytical Geometry to Represent Certain Kinds of Statistics. Vol. ixxix. (1916), On the Mathematical Representations of Statistical Data; continued in vol. Ixxx. (1917). Two memoirs may be cited as of particular interest — those of May 1917 and March 1918 — because they reply to criticism and draw a comparison from their author's point of view between his curves and those of Professor Pearson.] CHAPTER XVIII THE NORMAL CURVE OF ERROR Let us return for a moment to the general statement on p. 143, that ' \^henevcr we have n similar but indej)cndent events happen- ing in which the probability of success for each is j^, the different resulting possibilities as to success are given by the successive terms in («+/)", namely, 1 . ^ and their correspondent probabilities by the successive terms in ip-\-q)", namely, J. • ^ When we come to try and ajiply this theory directly to cases other than those of random sampling in artificial experiments ^\ith coins, dice, etc., we are faced at once with difficulties because of the limiting character of the assumption on which the theory rests, namely, that all the events are to be similar and independent. The similarity demanded is of the same radical type as that existing when we throw the same die or spin the same coin tAvice running, and the test for it is that p, the chance of success, is to be the same for every individual event. The independence is to be such that no single event and no combination of events is to have any influence upon any of the rest. Now for most classes of events it is impossible to assign any a priori value to p at all, still less can we be sure that p does not change from one event to the next. For example, the chance of death for soldiers in Avar-time varies from regiment to regiment according to where they happen to be located ; for the same regi- ment it varies from battalion to battalion according to whether they are in tke trenches or behind the lines ; and from individual to individual according to innumerable little accidents of time, place. 232 STATISTICS and condition. Also, where the shells burst thickest, p increases for any soldier there, but it increases also for his neighbour. Thus the events in such a case are not similar, neither are they inde- pendent. Moreover, as it stands, the theory cannot be applied to any distribution in which the character observed is capable of continu- ous variation. Tliis difficulty, however, has been overcome, as we have seen, by replacing the histogram representative of the binomial by a continuous curve which at the same time serves to describe the discontinuous series to a high degree of accuracy. To illustrate how close this description can be, even when 7i is comparatively small, we will fit with its appropriate normal curve the symmetrical binomial polygon formed by joining up the summits of the ordinates representing successive terms of the series erected at unit distance apart. The total area bounded by the polygon, the extreme ordinates, and the axis of x is practically =sum of the given ordinates =2i»(Hir =1024. . .)x(l) The equation of the normal curve is where (T^=npq and llXiXf: •75, Yo=N/\/27r • a=1024/V(5-57r). Hence, taking logs, we have X' log ?/=log Yo-—, logioe Zct" =2-3915437-a:2(0-0789626). It is easy from this equation to calculate the normal curve ordinates corresponding to x=0, 1, 2, 3, 4, 5, and the results, compared with the polygon ordinates, are as follows : — - THE NORMAL CURVE OF ERROR 233 X Ordinate of Polygon. Normal Curve Ordinate. 0 252 246-3 ±1 210 205-4 ±2 120 119-0 ±3 45 48-0 ±4 10 13-4 ±5 1 2-6 Now although the circumstances in which the series {\Y^n{\Y-\\)- .M^-l)a,n- 1.2 {\Y-\W^ may be taken to represent the frequency distribution resulting from a particular kind of experiment were so stringently defined, there is no reason why the normal curve itself to which the theory led should be subjected to precisely the same limitations. After all, the real and only justification for choosing one curve rather than another to fit any given observations is that it does succeed in fitting them better. But when the further question is asked why the normal curve should succeed in describing some results so well, we must not be tempted by analogy to rush to the con- clusion that the causes at work are necessarily independent, and equal, and so on. In short, the theoretical justification and the empirical use of the normal curve are two quite different matters. Experience shoAvs that the normal curve suffices to fit certain types of distribution, besides those which arise in tossing coins and in similar experiments, with remarkable accuracy ; among these may be noted : — 1. Certain biological statistics ; for instance, the proportions of male to female births taken over a series of years for a large com- munity such as the population of a countr}^ ; also the propor- tions of different types of plants and animals resulting from cross- fertilization. 2. Certain anthropometrical, jJdrticularli/ craniometrical and allied statistics, such as the height, weight, lengths of various bones, skull measurements, etc., of a large group of persons, and the agreement is the closer if the group be reasonably homogeneous, i.e. composed of individuals of the same nationalitj' and sex between the same narrow age limits, etc. ; also measurements of a similar character in animals and plants. 3. Errors of observation in experimental work ; for example, 234 STATISTICS several measurements of the same quantity — length, weight, speed, temperature, or whatever it be — will contain errors of this kind which are equally liable to be above or below the true value. 4. The marks of shots upon a given target, assuming that the shots are equally liable to err in any given direction. This is an interesting case of the normal law in two dimensions, for the north and south line and the east and west line through the centre of the target may both be regarded as axes of normal curves of error.* 5. Certain sociological statistics of a comparatively stationary char- acter ; for example, rates of birth, marriage, or death at neighbour- ing times or like places ; also the wages (and possibly the output if it could be satisfactorily measured) of large numbers of workers engaged in the same occupation under the same general conditions. 6. Any statistics or quantities that are individually compounded of a large number of elements, mostly independent of one another, which themselves vary between limits not very widely divergent, and none of which exert a preponderating influence upon their resultant statistic. The latter may be simply the sum of its elements, or, more generallj^ it may be any function of the elements which, to the first degree of approximation, can be expressed in linear form. Now it would be a difficult matter in most of these cases to satisfy ourselves as to the fulfilment or non-fulfilment of conditions like those on which the binomial distribution rests. It is not easy indeed to visualize them perfectly, except in artificial experiments where they are largely under control. If anything, the chances seem almost hopelesslj^ against their fulfilment in ordinary life, so closely must we hedge round our sample to keep out unequal influences. For example, to use a frequently quoted illustration, if p measures the chance of death for an individual, the death rate varies, as we know, considerably from place to place according to the age and sex constitution of the population ; it is influenced by differences in class, and occupation, and manner of life ; it is altered from time to time, violently by the ravages of war or disease, more gradually by improvement in general sanitation, housing conditions, etc. We should only expect to get the binomial distri- bution (and consequently the normal law if it depended upon the [* Sir John Herschel published in the Edinburgh Review (1S50) an a priori proof of the normal law from a consideration of this problem. Taking 2 . (2) Fio. (42). Geometrically, the area represented by the shaded portion of fig. (42) measures the frequency of errors between -j-Xj and -{-x^, while the complete area between the curve and axis X'OX measures the total frequency, so that the probability of an error between -f Xi and -\-X2 is measured by the proportion which the area of the shaded portion bears to the whole area. dx If in the above expression (1) we put x/a—^, so that — =ct, di it becomes 1 /-fa (3) which is known as the 'probability integral, ^^ and ^0 being the 240 STATISTICS values of | which correspond to the values x^ and x^ of x. But this integral measures the area of the shaded portion of the curve 1 y- ■hP V27T (4) shown in fig. (43), which is really the normal curve over again, but drawn on a different scale, namely, with the ordinates reduced in the ratio N : a and with the standard deviation a taken as the unitof measurement for a:, for I =1,2, 3 . . . when.r=CT, 2a, 3a, . . . This has the effect of making the total area unity and the area given by a-, d| (3) bis V27r4 now directly measures the probability of an error between af i and af g- y Tables have been prepared (see pp. 284, 285) which enable us to write down the value of this integral for different values of fi and ^2 between certain limits (see Appendix, Note 10). Let us take an example to show how the curve may be used, and we choose one leading to a binomial distribution, so giving an expression for the probability by first principles, in order to compare the two methods. Example. — Suppose we toss simultaneously 100 coins, and sup- pose the chance of success, say ' heads,' is the same for each coin and equal to 1/2. In that case, according to the binomial theory. the probability of 100 heads ={l/2)^^^, „ 99 heads and 1 tail =iooCi(l/2)99(l/2), „ 98 heads and 2 tails =10002(1/2)98(1/2)2, andso on. The most probable number of heads =np = (100)(l/2) =50. This does not mean, as explained before, that if we perform the experiment once we are sure on that one occasion to get exactly 50 heads and 50 tails, but that if we go on repeating the experiment we shall in the long run get 50 heads and 50 tails turning up more often than any other combination. Let it be required to find the probability of getting at least 55 THE NORMAL CURVE OF ERROR 241 heads, that is, we want the probabihty of gettmg 55 heads or more, and this is given by a sum not very readily calculated if we have to go at it in a straight- forward manner. Now let us turn to the curve of error method. The standard deviation for the distribution is given by Since the mean number of heads to be expected if the experiment is repeated a considerable number of times =50, we want to find the probability of an error equal to or greater than 5, i.e. an error lying between a and 4-00> because ct=5. But the probability of an error between a^^ and cr^g V27T.'h Hence the required probability 1 -^ \/277.'l =0-15866, by the probability integral tables. In other m ords, if we repeated the experiment 100 times, we might expect 55 or more heads about 16 times. We can now show that if Xj, Xg are two uncorrelated variables obeying the normal law, then {w^-^-\-w.^<^) will obey the same law. Suppose Xy, Xo are observed deviations from the mean values Xj, X2 in one particular record, ctj, o-g being the respective S.D.'s. Let X=?rjXj+w'2X2' ^^^^)^/2-^2'-'.^ by (2). Now this is in a form which only involves hx, x, and x^, and we get the total probability for an error \y\ng between x and {x-^hx) by givmg all possible values to the error x■^^. But the j)robability for x^ itself to lie between x^ and {x^-\-hx^ _ 1 r^i+^-'-i g-x-/2...^- . ... . ;z 1 J B, _ ^c: 7 ^ ;"] , 1 " ' \^ .... jtoj-^f^ i ; ; I 4.._j — -■'■^Vl 1 f-- - - " ^-? ;t = == , if we give y some particular value t/j, 27TVl — r- . a^Uy we find that the law of frequency for the corresponding x is __JL_*^(l-r2)+('£-,M*} =U,.e 2(l-r2)|o-/ '^V.T. C3 . . .). Then the index numbers of the separate commodity prices at tlie third date, taking the prices at the first date as standard, are 100'^^ 100-^ lOO'i . . . «! b, c, Hence the geometric mean of these n index numbers together V \ Oi h Ci = 100gr3/j7„ where g^, g^ denote the geometric means of the u prices at the two dates. It follows that the ratio index number of prices at 3rd date with prices at 1st date as standard index number of prices at 2nd date Avith prices at 1st date as standard _1Q06^3M lOOgJg, =93/92- It is therefore quite independent of the particular date chosen as standard. 4. The Mean of Combined Sets oi Observations. (I) Suppose one variable x is expressed as the sum of a number of other variables, thus x=a-\-b-\-c-\- . . ., and suppose that we have n different values of the variables, giving equations of the type a;2=a2+^2+C2+ • • • a^n=an+^« + C„ + 266 STATISTICS Hence, by addition, ^-l + '^•2+ • • +-^'„ = K+ • • +«J+(6i+ . . +6n)+(Ci+ . . +cj+ . . so that nx = nd-\-nS-\-nc-\- . . . x—d-[-b-\-c-{- . . ., where x, a, b . . . denote the means of the n values of the respec- tive variables. Thus the mean of a sum equals the sum of the means, and, if some of the positive signs in (a+6+c+ . . .) are made negative, there will evidently be a corresponding change of sign in (a+^+ • • •)• Example. — Suppose 100 family budgets are collected and the items in each are separated under five heads — rent, food, clothes, coals and light, sundries. The expenditure, x, in each budget would thus be expressed as the sum of five variables, a, b, c, d, e, and the mean of the 100 dififerent x's ^A'ould equal the sum of the means of the a's, the 6's, the c's, the rf's, and the e's. (2) Sets of observations are made which differ in locality or time or some other respect. To find the resultant mean. Let I observations of the variable x refer, say, to one date, ,, m ,, „ „ „ ,, a second ,, ,, n ,, ,, ,, ,, ,, a third ,, and so on, and let the means of these successive groups of observa- tions be Xi, x^^, x'„, . . . , so that we may MTite Xi=ZxJl, x^^ExJm, x^^SxJn, . . . If then X be the resultant mean, we have l-\-7n-{- . . . l-\-m-{- . . . Exajnple. — If the school children in the different schools of a county are weighed, I children in one school, m in another, n in another, and so on, giving mean weights x^, x^^, x„ . . . , the resultant mean weight for the children in all the schools combined is then given by the above expression. 5. Me^n and Standard Deviation of a Distribution of Variables. Let x^, x.^, x.^ . . . x^ denote the deviations of each value, or group mid- value, of the observed organ or character when measured from some fixed value, and let f^, f^, fz ■ • • fn denote the observed frequencies of these respective deviations. APPENDIX 267 The arithmetic mean of the variables is thus given by ^ = (/l-^-l+te+ • • • +/n^n)/(/l+/2+ • . • +/n), referred to the fixed value as origin. We may conveniently represent the deviations x^, X2, x^ . . • by lengths measured from an arbitrary origin 0 along a straight line, in which case the point 0 defines the position of the fixed value from which the variables are measured. Let P mark the position corresponding to a typical variable and let G mark the position corre- _ ^ e ^ sponding to the mean, x. Thus i: — '■ ± i OP=a*, 0G=^% and if we denote ^ -^" ' ^ the distance of P from G by ^, we have x^x+^. Hence ^■ = (/A+/2'f2+ • • • +/n^-n)/(/l+/2 + Hfl{^^-^l)+M^ + U+ ■ ■ • +fn{£- = [^(/l+/2+ . . . +/„) + (/ili+/2f>+ • Therefore {f,i,+f,l,+ . . . ^hL)-^ The expression {!iXi+f^x.,+ . . . -\-fnXn) is called moment of the distribution referred to 0 as origin. We conclude that when the chstribution is referred to G as origin, i.e. when deviations aremeasured from the mean of the distribution, the first moment vanishes. • • • +/n) -L)ViIi^f2+ • • • • • +/nln)]//l+/2 + +/n) r-fn) . (1) the first Frequency Distribution Table. (1) (2) (3) Deviations of Var- iables from some fixed value. Frequency of Deviations. Product of Nos. in Col. (1) and Col. (2). Product of Nos, in Col. (1) and Col. (.3). /i /2 fz K fi^i f^2 fn^n f^2- N N'l N'2 In the notation of the above table, where the dashes are omitted in Nj, N, when the mean is origin, we have ;c=N'i/N andNi--0. 268 STATISTICS Again, the root-mean-square deviation, s, measured from the arbitrary origin 0, is given by =N'2/N, and N'2 is called the second moment of the distribution referred to 0 as origin. Substituting as before we have s''=Ul{^ + Lr^ ■ • • +/J•^•+^«)']/(/l+ ■ . • +fn) jHf,+ . . ■ +/n) + 2^-(/i^i+ • • • +/n^J+(/lf\+ • ■ • -\-fnL') (/l+ . • • +/J since /i^it- • • • -^fJn=^- Hence s''-x^-fcT^ . . . (2) where a is the root-mean-square deviation measured from G as origin, or the standard deviation as it is called. From this result it is clear that ct is always less than s, or the root- mean-square deviation is least when measured from the arithmetic mean. Generally, if we write Vi:={h^'+ ■ • • +/.^/)/(/l+ • • • +fn), V, = {fii^'--h . . . /nl/)/(/H- • • • +/n), where 2J(fx'') and -S'i/I'") may be called the kth moments referred to 0 and to the mean as origins respectively, so that vi—O, V2=o^, p'2=s^, we have l''* = [/l(ll + ^f + • • • +/n(fn + ^)'']/(/l+ • • • +/n) (/l+ • • • +/n) For example, when h=2, since vq = ^ and 1^1 =0, v^^v\-'k'' . . . (2) bis Again, when A;=3, v3=-v' s—^i'i^—^ • • • (3) and, when k=4, v^=^v\'-^i>3i-6v^x'-x* . . (4) There are interesting statical analogues to the above results concerning the mean and standard deviation. APPENDIX 269 f Let us imagine a set of weights, /], /g, fn . . . suspended at Pj, Po, P3 . . . from a straight horizontal bar, and let the distance of any typical weight / from some arbitrary origin 0 on the bar be x. Then the first moment, (where some of the x's may be negative corresponding to weights suspended to the left of 0) measures the total turning effect of all the given weights about 0, and if we further imagine all these weights replaced by a single weight ^ -_-._- .v -^v equal to their sum (/1+/2+ • • • ^ — "^—^ r^ — -}-/„), then, in order to produce X the same turning effect, it would / have to be placed at a point G, the distance of which from 0 is given by Thus x={f,X,^f,X,-i- . . . -^f^xJ/if,^f,-{- . . . +/J, and, statically, this defines the position of the centre of gravity of the given weights, /j, /g, •••/„, relative to 0. As before, x=i:j{x-\-^)/i:f =^x+SmEf; hence f 1^1+^2+ ■ • • +AI«=0, and, statically, this means that the turning effect of f^, fz • • • fn about G is zero, in other words, the bar would balance freely about G. Again, the second moment, /l^"l+/2^'"'2+ • • • +/n^'n"j measures the moment of inertia of the weights /j, /a • • • /« about 0, and, if we imagine these different weights replaced by a single weight (/i+/2+ • • • +/n) as before, the moment of inertia will be unaltered if the latter be located at a distance s from 0, where (/1+/2+ . . • +/nK^=(/r'«^+/2^^+ • • • +/„^V-); therefore s'Mh^\+ . . . +Aa:„2)/(/,+ . . . _,_yj =X'-\-a-, as before, and the interpretation of this is that the square of the radius of gyration of the system of weights about 0 equals the square of the radius of gyration about G, the centre of gravity of the system, together with the square of the distance of G from 0. Also, 5 is clearly least when it is measured from G. 270 STATISTICS G. The Mean Deviation a Minimum when measured from the Median. Consider first the case when only two different values of the variable are observed, X^, X2, and let their deviations from an arbitrary value, 0, chosen as origin, be respectively x^, x^. If /i, /g be the observed frequencies of these values, the sum of their deviations from 0 is ^~ V X '^^' ^x ^'^^ich is clearly less when the ' t —^ — — — — y value 0 lies between X^, X2 ' - than when it is smaller or X, .r, o x^ Xj greater than both of them. f^ f Choosing 0, therefore, be- greater frequency we write the deviation sum where x is the deviation of either of the values X^, Xj from the other, and (fi—f.^) is positive since /i>/2. Now this is evidently least w^hen {f-^— J 2)^1 vanishes, i.e. when (1) ^1=0, in which case 0 coincides with Xj, the more frequent of the two variables, or, when (2) /i=/2, and in this case, when the two observed values occur equally often, the deviation sum is constant for any origin between X^ and Xg. When several different values of the variable are observed, they may be arranged in order of magnitude, X^, X2, Xg . . . X„, from the least to the greatest, with frequencies f-^, f.^, f^ . . . fn- If /i>/„ we pair off /„ of the X„'s with /„ of the X^'s ; the devia- tion sum for this pair is least and remains constant when measured from any origin between Xj and x X X X«-i X X„. We next pair off some or all i ^ 1 r' "i of the Xj's which remain against ' " '' an equal number of X^.^'s and the deviation sum for this pair is least and remains constant when measured from any origin between Xi and X„_].. If some X^'s still remain, we pair them off so far as we can against an equal number of X^.g's but, if it be X„_i's that remain, we pair them off against an equal number of Xo's. This process can evidently be continued until ultimately we reach the origin from which the mean deviation of the whole distribution is a minimum, for if any X be left unpaired the origin will coincide with that X. Otherwise, the deviation is least wheii ' APPENDIX 271 measured from any value between the last two X's paired ofE together, and within that range it is constant. Since, by definition, the median is the value of the variable half- Avay along the series of given observations, ranged in order of their magnitude and assigning each its due weight or frequency, it is clearly such that a balance can be effected hy pairing off the values on either side of it against one another in the manner explained above ; it therefore follows that the mean deviation of a frequency distribution is a minimum when the deviations are measured from the median. The statical analogy to the median also is worth noting. With the same notation as before, the moment or turning effect of two forces, /j, /a, about 0 is _^^_ . . ^ ^ /l^l+/2^2* OX, X But in this case, if 0 be taken y- \ at some point in between X^ y A and Xg, since the mean devia tion sums the separate devia- ^ tions Avithout regard to sign, I we must imagine /^ reversed ^ so as to produce a turning effect in the same direction as before. The moment \\d\l then be still (/la-'j+A^'a)) ^^'^ i^ is less when 0 occupies such a position than when it is on X^Xg produced in either direction. Taking 0, therefore, somewhere in between X^ and X2, the moment may be Avritten =/2(^i+-'^'2)+^i(/i-/2) ; and, if /i>/2, this is least when x^ vanishes, that is, when 0 coincides with Xi, but if /i=/2, the two forces constitute a couple, and the moment is the same whatever position 0 occupies between X^ and X.. 7. The Method of Least Squares. To the student \\'ho is un- acquainted with the dififerential calculus, the following descriptive argument, the basis of the principle of least squares, for determining the values of m and c which make (ma;i+c-2/i)- + (m.r2+c-y2)-+ • • • -\-{mx„-^c—y„)- ... (1) a minimum, may prove instructive. Let us call the above expression E and let us suppose that different values are given to m while c remains unchanged ; in that case E 272 STATISTICS will vary with m, and we might imagine the different values obtained for E plotted against the corresponding values of m giving a curve of some type. Such a curve may rise and fall in wave -like fashion as in the figure, resulting in maximum points like A and C, and minimum points like B, where we define a maximum point to be such that, as we move away from it along the curve, whether to left or right, the size of the ordinate (and therefore the value of E) decreases ; likewise, a minimum point is such that, as we move away from it, the ordinate (and therefore also E) increases. In the neighbourhood of such points it is clear that the size of the ordinate, such as Aa or B6, changes so slowly as to be practically stationary. Suppose then that m and (m-j-ju.), />t being very small, are two values of m respec- tively at and near a minimum position on the curve, i.e. a position like B corresponding to a minimum value for E. Since E near such a point does not differ appreciably from E at such a point, we may prac- tically equate the two expressions obtained for E by substituting (m+/i) and m respectively for m in (1), thus (m + yLi.Ti + C-i/i)2-f (wi+/X.T2 + C-2/2)2 + = (m.ri+c-?/i)2+(mx'2+c-?/2)2+ . . . =(mXi+c-?/i)2+(ma:2+c-2/2)--f- • . . [(maJi+c-i/i)2+2/xa:j(m.Ti+c— ?/i)+/^Vi]+ . . . =(wa;i+c-?/i)2+ . . . Thus [2xi{mXi-\-c—yj)-\-ixx'^j}-\- . . . =0. Now, the smaller we take fx, the nearer to the truth does this result become. Hence, by making fx tend to zero, we are led to the strictly true relation Xi{mXi^c-yi)-{- ... =0. This is one of the equations in the text. To obtain the second, we keep m constant and vary c. Suppose c and (0+7) are two values of c at and near a minimum APPENDIX 273 position on the curve ; then, equating the two corresponding values of E, we have as before (m^i+c+y— yi)-+ ={mx\+c—yi)- + -={mx\ + c-yi)--{- =0. [{mx^-\-c-yi)- + 2y{mXi+c-yi)+y-]+ Thus [2(wx-i+c— ?/i)+7] + and, proceeding to the Hmit when 7 tends to zero, we reach the other equation in the text, namely, (m.ri + c-y,)+ . . . =0. [The Method of Least Squares came first into prominence in Astronomy in connection Avith the determination of the best value to take when a number of observations, apparently equally reliable, give results not quite in agreement. If, for instance, x be the true value of some variable, and if .rj, x.,, x^ . . . a'„ be the results of n observations, the method of least squares assumes x to be given by making ij = (^x-Xi)'^ + {x-x.;,)--\- . . . +{X-Xnf a minimum. Now — =2(,f— a;i)-r2(.i-— .{•„) |- . . . +2(.f—.T„), and this vanishes dx Avhen {x—Xi) + {x—x.,)-\- . . . +{x—xj^0, i.e. a;=(.ri4-.r2+ . . . -j-xj/n, so that in this case we are led to the ordinary arithmetic mean of the 71 observations as the best value. The method was used by Gauss as early as 1795.] 8. To prove •+00 J -co ^''Ax^Vtt. Let .'-oc thus, also, r+co 1= e-^'dy; .'-co therefore, 12= e-^dx +03 ' -co '+00 ,"+co r+oo ,-+00 = I I e-^^+'^'^dxdy J -OD .' - CO = e-'ydrdd Jr== o.'e=o s 274 . STATISTICS (by changing to polar co-ordinates) .'o .'0 = =(i)(27r). Hence 1=1 e-''''dx = Vn. J -OS 9. To prove :— (1) r(n+l)=nr(n). (2) B(m, n)= (1) r(?i+l)=i a;"e-^dc .'o ,-co = — / .x-"rf(e-*) .'.r--n r(m)r(n) r(m+n) ' .r"e- =nT{n), because the expression in square brackets vanishes at both limits. (2) r{m)r{n)=f e-^^"'-hIi( e-^rj^'-^dr, Jo Jo /■CO I'CO — I e-^'^x-"^'-'2xdx\ e--''~ij-'^--2ydy, Jo Jo where x-=^, y'^=^r]. •cm rco Hence r{m)r{n)=i e-^'^+-'^'>x-"'-hj-'*-Mxdy .0 .'o = / e-'Y-"'+2«-2 cos2"»-'0 sin^"-!^ rdrdO Jr = oJe=--o (by changing to polar co-ordinates). Thus r{m)r{n)=! e-'V-'"+-"-Wrf cos-'^-W sm-^-^ddO Jo .''J if g-;'^m+n -Ig^p - 1^-1(1 -^)'""'rf^l. where p=r2 and /:=sin2^ ; therefore, r(m)r(w)=P(m+?t)B(w, m) ^r(m+w)B(w, w) by symmetry. APPENDIX 276 10. Elementary Method of Testing the Probability Integral Table. The reader may find more satisfaction in using the probability integral table if he tests for himself one or tAvo of its results by means of squared paper or in some other wixy. We have seen that the probability of an error between 0 and a^ is given by the expression V27r.'o ^ Put ^ — V2x, and this becomes 1 - — i e-'^dx^i e-^dx \ e-^W.r, by Note (8) =area OBPN/area A'BA, in the tigure. Now the graph of y=e~'^ is draAvn in fig. (40) of the text, and it is possible therefore to get an approximation to the above result for any value of x by counting the number of small squares in that figure enclosed by the areas corresponding to OBPN and A'BA respectively. Each complete small square may be reckoned as 1 , and each portion of a square may be reckoned as 1 if it exceeds half a square and as zero if it is less than half a square. This gives, for example, 1 rO-25 *^V.r =98/707 =0-139, jy=e- Vtt-'O whereas the tables give 0-138. For a value like ;r=0-71, count the squares in the usual way between curve, axes, and ordinate a;— 0-70 : then add to the result one-fifth of the number of squares in the small slice of area between curve, axis, and ordinates .t;=0-70 and x=0-~o. We get 1 /■""' o — = e-^c;a;=240/707 =0-339 as compared with 0-342 from the tables. These results are not unsatisfactory considering the rough nature of the method followed to obtain them. 276 STATISTICS 11. The Law of Fretiuency in the case o! two Correlated Variables with certain Deductions therefrom— [based on Professor Karl Pearson's memoir, Regression, Heredity and Panmixia {Phil. Trans., vol. 187a, pp. 253-318)]. Consider two variables whose deviations, x and y, from their resiJective means are due to a number of independent causes, the deviations in which from their means can be quantitatively denoted by €i, €2, • . . e™. We assume that each e deviation is so small compared to the mean value from which it is measured that x and y can be sensibly expressed as linear functions, thus a;=aiei+a2e2+ • • . +a»ie,„ • • • (1) 2/=6iei+V2+ • • • -^K^n ' • • (2) (Some of the as and 6's may be zero, and if x only involved, say, fj, €2 ■ • • €;., and y only involved ej.+i . . . e,,^, then it would be natural to expect no correlation between x and y.) We further assume that each e varies according to the normal laAV with S.D. CT with appropriate suffix. Equations (1) and (2) show that the same x and y may arise in a multitude of different ways obtained by varying the e"s so that their weighted sums (the a's and 6's being the weights) remain unaltered. The probability that the particular deviations l^'ing between ti„(ei+Sfi)> 62.^(^2+8^2)' • • • e„„(e,„ + Se,J shall concur, since they are all independent, is \GiV27r / \a„y27T But, \vriting a3e3+ • • • +ct„,e„,=a, 6363+ . . . +6„,e,^=^, equations (1) and (2) become aiei+«2f2+(a— •^■)-=0 ^e,+6262+(/3-y)=0. Therefore - And, for any function v, J J J J \oei de-2 oe^ oej/ =r (a^b^- a2bi)jjvdeid€2. APPENDIX 277 Hence The ^o/aZ ijrobability for deviations between a; (x-j-Sx) and y^iy+Sy) is obtained by integrating z between limits — CD and +oo for all the e's from e^ to e,,,, and it is not very difficult to see that this will ultimately lead to an expression of the form C . SxSy . e-^^^-'^^^-y+^y''\ This is the required law of frequency. To find the meanings of the constants a, b, h. The total probability for a deviation between .r (.r+Sa;) associated with any deviation y is .'-co J -co ^CVTT/bSxe-'''^"^-''-^'\ But if X be subject to the normal law, the probability for a devia- tion between a;^(.r+8.r) is ^ 8x -a-'/io-x^ V277 . a, where a^. is the S.D. of x independent of y. Comparing these two results, we have l/2a^'=(ab-h')/b=a(l-r='), if r = — h/Vab. Similarly, l/2CT/=(ab-h')/a=b(l-r'), so that h=— r\/ab=— r/2a^o-y(l— r"). Again, we may integrate z for all values of x and y, and so get the total frequency, N, of the {x, y) pair. ,--rCO -+00 Thus. N=C e-^'-'''-^-^'y+''^-^dxdy . -co . - CO =CV7Tjbj'^\-''<"^-^"-^'''dx J -co . ■■ =^CVT^bVWibl{ab-¥)l 278 STATISTICS Hence TT TT N 27ra,(T,V(l-^') Thug 2=Ce 2(i-7-.^)L-2a\v 2j^|"l _ ^^ [-in-^rnK], = {l-r-)-"^-e 2(1-'-) where k ^Uxy/nar^y -^ g 2 1 - /■- Now the probability of this particular distribution is greatest when - i log {l-r^-) + -^ l—r- is least, and, differentiating with respect to r, this leads to 2r {l-r^-){-K)-\-2ril- Kr) *. -i- u h-f^ (l-r-)2 i.e. -r(l-r-)-/c(l-r-) + 2r(l-«?-)=0, i.e. -r+r3-/<:+/cr2+2r-2/fr2=0, i.e. (r-«)(l + r2)=0. It is not difficult to show that r=-K gives a minimum ; hence the required probability is a maximum and we get the best value for the coefficient by taking r —/f =Z'xy WjCTy. APPENDIX 279 CERTAIN CURRENT SOURCES OF SOCIAL STATISTICS Any one who is anxious to get reliable figures bearing upon some social matter is somewhat at a pause unless he is thoroughly con- versant with all the statistical ramifications of Government autho- rities, local and national, of trade unions, friendly societies, and hosts of other bodies of a public or semi-public character. While recognizing the lavish outpouring of statistics of all kinds upon a multitude of diverse topics every j'ear, and aitpreciating the immense care and patience shown by those who are responsible for their collection and preparation, one cannot but deplore the lack of any co-ordinating principle in general between one body and another either in deciding what statistics shall be collected, by whom and when they shall be collected, or how afterwards they shall be tabulated and presented to the public. Too often a narrov.- minded jealousy prevents one authority from consulting mth another, and such co-operation as does exist is due largely to the efforts of able and enlightened individuals. The result is that a vast amount of labour and expense goes waste and the loss to the public is incalculable, but the public do not care, and they do not care because they do not know. At present, to quote from an influential petition on the subject recently presented to His Majesty's Government, ' It is almost universally the case that any serious investigation is reduced to roughly approximate estimates in relation to some factor which is essential for its result. ... It is not too much to say that there is hardly any reform, financial, social, or commercial, for which adequate information can be provided with our present machinery.' But this state of things Mould be partly remedied by adequate control such as might be secured by the establishment of a central statis- tical office M-ith a minister in charge who should be responsible for unification so far as possible in the collection, tabulation, and issue of all public statistics. It is scarcely possible for a single private individual to make a quantitative investigation of any social question on a large enough scale to produce results of real value ; conspicuous instances like Booth and Rowntree might seem to be exceptions to this rule, but even they had a number of workers acting under their direction, without whose aid their task would have seemed almost hopeless. 280 STATISTICS For such statistics as we have we are therefore dependent upon Government departments, local authorities, public officials, trade associations representing employers or labour, public companies, and so on. The reader Avho A\dshes to get some idea of the extent and the limitations of official British statistics is referred to the admirable introductory chapters of Bowley's Elements of Statistics. Here we cannot do more than mention a very few of the most important sources whence such statistics are derived. The most voluminous of all our records is probably the Census of the Pojyidation which is taken every ten years. Its scope is but faintly realized by enumerating the chief subjects on which the Registrar- General asked information from each householder in 1911, namelj' : (1) Numbers and Geographical Distribution of the Population. (2) Nationality' and Birth-place. (3) Numbers at Different Ages, Male and Female. (4) Numbers Single, Married, and Widowed. (5) Sizes of Families, including Children Dead. (6) Numbers engaged in different Professions and Occupations. (7) Numbers Blind, Deaf, Dumb, not in their Right ]\Iind. (8) Numbers occupying Dwellings of Different Sizes as measured bj' the Number of Rooms. This may seem an ambitious scheme when it is stated that the mere enumeration of the people was successfully opposed less than two hundred years ago as ' subversive of the last remains of English liberty and likeh^ to result in some public misfortune or an epidemi- cal disorder,' and the first census was only taken in 1801. [See Article in the Encyclopaedia Britannica on the subject.] The results of each census are published in bulk}' volumes as soon as they can be reduced and tabulated, a process which, of course, takes a considerable time even for an army of workers with calculating machines and every modern device to facilitate their progress. It is to be regretted that more is not done to advertise so valuable a record of work by publication in a cheap and attractive form of a summary of matters which vitally affect the good of the commonwealth. As it is, the census volumes tend to be purchased only by public authorities and officials who require to use them occasionall}' as books of reference. Neglect of the blandishments of advertisement— to be commended in general because such neglect is somehow associated with the presentation of all truth^may be perhaps carried too far in the issue of statistics. APPENDIX 281 It will be noted that in the periodical census no mention is made of wages though the people are classified as regards occupation, and for information upon this point we must turn to another source. The last general census of wages was taken in 1906, following and improving upon an earlier inquiry twenty years before, but, in connection with an inf(uir3^ by the Board of Trade into the cost of living of the working classes, information was collected as to rates of wages in 1912 of Avorkpeople in certain occupations in the building, engineering, and printing trades, these being selected as industries common to most towns, and because the time rates of ^\■ages paid in them are largely standardized. The 1906 inquiry into earnings and hours of labour, unlike the decennial census, was conducted on a voluntary basis and was never wholly completed. In brief it set out to discover from emploj'ers : — (1) The Numbers of Working-people Employed in Various Occupations, distinguishing Men, Women, Lads, and Girls. (2) The Nature of the Work done and the Rates of Wages Paid, distinguishing Time Rates from Piece Rates. (3) The Hours Worked, distinguishing Under- or Over-time from Normal Time. The ground actually covered by the inquiry embraces the fol- lowing trades : Textiles, Clothing, Building and Woodworking, Public Utility Services, ]\Ietal, Engineering, and Sliipbuilding — in 1906 ; also Agriculture, and Railway Service — in 1907 ; the reports upon these trades were published separately at different dates between 1909 and 1912, and the following trades were bulked together in one volume, published in 1913 — Paper and Printing ; Potter}-, Brick, Glass, and Chemicals ; Food, Drink, and Tobacco ; and Miscellaneous Trades. The Cost of Living Inquiry of 1912 was in continuation of a similar inquir}- in 1905, wliich in addition compared conditions in the United Kingdom and certain foreign countries. It dealt not only with wages but also with rents and retail jyrices. The report states that ' particulars as to the rent and accommo- dation of typical working-class dwellings were obtained from officials of local authorities, surveyors of taxes, house owners and agents, and by house-to-house inquiry.' Also ' returns of the prices most generally paid by working-class customers for a number of specified commodities were obtained in each town by personal inquiry from a number of retailers engaged in working-class trade.' Since then Lord Sumner's Committee and a Committee of tha 282 STATISTICS Agricultural Wages Board have examined the change in the cost of living between 1914 and 1919, as evidenced by a number of house- hold budgets collected from among urban working- classes and workers in rural districts respectively. One other highly important inquiry carried out by the Board of Trade deserves notice, namely, the First Census of Production of the United Kingdom (1907). The published report shows : — « (1) The total Net Output in Money Value for each Trade Group in each Industry. (2) The Number of Persons Employed in each Trade Group (salaried persons and wage-earners exclusive of outworkers). (3) The Net Output per Person Employed in each Trade Group as deduced from (1) and (2). (4) The Horse-power of Engines in Mines, Quarries, or Factories Employed in each Trade Group. It is explained that the term ' net output ' here represents the value of the aggregate output of the factories, etc., from which returns were received in each trade group, after deducting the cost of materials purchased from factories, etc., not included in the group, or supplied by merchants or others not making returns to the Census of Production Office. Valuable as the results of these inquiries undoubtedly are, they would be of still more value were it only possible satisfactorily to collate the various returns of population, wages, and production. No record of wages was included, for example, in the Census of Production statistics, and it is quite impossible to deduce the number of wage-earners and those dependent upon them in any trade at any given time. Apart, however, from such special inquiries as we have instanced, and the ten-yearl}^ census of the people, there are other periodical records issued Avhich jarovide us with valuable information. The Ministry of Labour, until recently a special branch of the Board of Trade, charged Avith the duty of keeping in touch with labour .conditions, issues each month a Labour Gazette giving particulars relating to the state of employment in the principal trades in the United Kingdom based on returns from employers, trade unions, and employment exchanges, besides information concerning trade disputes, changes in wages and hours, the course of prices, railw. traffic receipts, foreign trade, etc. The Board of Trade also jsub- lishes M-eekly a Journal and Commercial Gazette dealing A\'ith matters of interest to all who are engaged in commerce or finance ; while a APPENDIX 283 Monthly Bulletin of Statistics of production, trade, finance, employ- ment, etc., at present issued under the name of the Supreme Economic Council, is an important recent addition to our knowledge of international statistics. Again the Registrar -General makes a quarterly return and annual summary of births, marriages, and deaths in the different counties of England and Wales, and of births, deaths, and infectious diseases in certain large towTis. In each public health area the medical officer reports periodically upon the hygienic condition of the district and the health of the people under his care. The Board of Education is answerable for conditions in the schools, and the Home Office in factories and prisons ; they report from time to time. The Ministry of Health similarly issues returns relating to pauperism and to housing, while the Board of Agriculture and Fisheries registers the acreage under crops and the number of live stock in the United Kingdom, and the Commissioners of Customs record the expansion or contraction of foreign trade. In addition we have the endless accounts and statistics supplied, some voluntarily and some compulsorily, by municipal bodies, l^ublic companies, banks, trade associations, co-operative societies, insurance companies, trade unions, etc. And yet, in spite of all this wealth of statistics, some surprising gaj)s occur, as we have already seen, in important particulars which cannot be traced. We shall quote only one more instance of such a hiatus — the income-tax returns provide a basis for measur- ing that part of the national income which is subject to taxation, some idea also can be formed of what the wage-earners receive, but as to the earnings of the portion of the community falling in between these two classes we are entirely' ignoi'ant. It is possible that war conditions during the years 1914-19 may have vastly increased the knowledge of the Government as to some matters such as internal resources and inland trade, of which little Mas' loiown before, but, if so, the public, ^hom it concerns so closely, have not yet been permitted fully to share in this advantage. For an excellent summary of labour statistics compiled or col- lected by the Government the reader is recommended to consult the Annual Abstract of Labour Statistics of the United Kingdom, Dublished in the past by the Labour Department of the Board of rade. Note. — A most useful Guide to Official Stati»(ics is now issued l»y H.M.iS.O. Dr. Bowley's Official Statistics will repay careful study in conjunction with it. 284 STATISTICS A NOTE ON TABLES TO AID CALCULATION The short tables which follow are only inserted as specimens, as it is expected that the reader who wishes to make extensive use of such tables will have access to the fuller ones to which reference is made below. 2-00 2-50 Probability Integral Table, giving area of curve z = ~~--=^e *^'' in V27T terms of corresponding abscissa, see fig. (55) : — i Kl + a) a 1 ill + a) a •00 •50000 ■00000 •76 •77637 •55274 •10 ■53983 ■07966 ■77 •77935 •55870 •20 •57926 •15852 •78 ■78230 •56460 •30 •61791 ■23582 •79 ■78524 •57048 •40 •65542 ■31084 •80 ■78814 •57628 •45 •67364 ■34728 ■85 •80234 •60468 •50 •69146 ■38292 ■90 •81594 ■63188 •55 •70884 ■41768 ■95 ■82894 •65788 •60 •72575 ■45150 100 ■84134 •68268 •65 •74215 ■48430 105 •85314 •70628 •70 •75804 ■51608 110 •86433 •72866 •71 •76115 ■52230 150 •93319 •86638 •72 •76424 ■52848 200 •97725 ■95450 •73 •76730 ■53460 250 •99379 •98758 •74 ■77035 •54070 300 •99865 •99730 •75 •77337 ■54674 ! 1 3-50 •99977 •99954 Fig. (56), the result of plotting a against |, enables us to estimate the probability of an error tying between any two limits. APPENDIX 285 Table giving P, to test ' goodness of fit,' corresponding to certain v.alues of n' and yj : — v' x-->i 5 C ' 8 9 1<» 11 12 13 1^ 02964 15 •02026 •<;7Ht)8 •54381 i •42319 ' -32085 -23810 •17358 •12465 •08838 06197 •04304 S •7707S •OoniXi -53975 ^42888 -33259 •2520(i •18857 •13862 10050 -07211 •05118 •03600 ;» •s:.7i2 •757-58 ' -0472:5 ^SfJOo •4:'.347 •34230 •20503 •20171' 1512(1 -11185 (18170 •(J5914 in •iHUl •834311-73992 •(53712 -53415 •43727 •35048 •27571 21331 -10201 .122:52 •09094 11 •1)4755 •891181 •81520 ^72544 •02884 •53210 •44049 •35752 i ^ajOO -22307 ■17299 -1:J2(J«> Vl •9()!);i2 •93117 -87330 \ -79907 •71330 •62189 •53039 •44326 -36204 •293:53 •23299 •182.50 X.\ •!)8344 •95798 [•91008 '•85701 •78513 •70293 •(;1590 -52892 -445(J8 •:?0904 •30(171 -24144 \\ •!»;ni!t ■97519-94015 -90215 •84300 •77294 •09:;93 -01082 -52704 -44781 •37384 -:'.0735 15 •!)9547 -98581 ' -90049 i -93471 •88933 •83105 •70218 -08604 •60030 -52(152 •44971 -37815 One of the earliest tables of the probability integral appeared in Kramp's Analyse des Refractions (Strasbourg, 1798), where the calculation of J"e~^-'' . _' __ _ _ ± «^_ - _ J' - i / X Jl'X- r< ' 1 _._ L^ 1-50 IMU. (50). 2-00 2-50 is reproduced in the admirable Tables for Statisticians and Bio- metricians, edited by Karl Pearson (Camb. Univ. Press, 1914), and the same volume also contains Palin Elderton's P Tables for testing ' goodness of fit ' which first appeared in Biometrika, vol. i., and Duffell's Tables of the Logarithms of the T Function from Biometrika, vol. vii., besides a large number of other valuable tables. It should be remarked in connection with the last-named table that the formula P(.r-j-l)=.r T{x) enables us to reduce the calculation of any T function to one in which x lies between 1 and 2, by repeated applications of the logarithmic relation, thus logr(x-+l)=log.r+logr(x) =log .r+log (a;--l) + log r(a;-l), 286 STATISTICS and so on. When x is large, however, say greater than 10, the well-known approximate formula (see, for instance, VVhittakers Analysis, § 110) will be found useful, and it may also be WTitten 1 r(a;+l) n QoonQoo , 0-03619121 , . , log — ^ ^=0-3990899-f +* log x, a form often convenient. It may be of service to record here the values of a few constants which frequently recur for speedy reference : , c = 2-718 2818 jr = 3I41 5926 logio 2=0-301 0300 1 = 0-367 8794 c Iogio7r = 0-497 1499 login 3=0-477 1213 logio 6=0-434 2945 logio(log,oe) = 1-637 7843 logio ^==1-600 9101 V27r The statistician who has Pearson's Tables, Barlow's Tables of Squares, etc., together with a good set of Tables of Logarithms (unless he is so fortunate as to have a mechanical calculator, for instance a Brunsviga, at his disposal) and of Trigonometrical Functions such as Chambers's Seven-Figure Tables, may consider himself amply provided for serious research and decidedly better off than his predecessors aaIio prepared the way for him by doing great work with much poorer tools. MISCELLANEOUS EXAMPLES [Selected from Lomion B.Sc. {Eeon.) Pass and Honours papers'] PART I (1) Define the genus ' average,' and the principal species of that genus. Adduce concrete cases in which (a) the Arithmetic Mean, or (6) the Median, is specially appropriate. (2) Supposing that statistics of rents of working-class dwellings have been collected in a certain district for a seines of years, describe some way of forming an index number showing the changes in rents from year to year during the period. Give reasons for the process you adopt, or state any advantages it appears to you to possess. (3) Measure by whatever method you think most suitable the correlation between the two following series, and show graphically the relationship between the two series. Exports Unemployment Exports Unemployment per head. Index, per head. Index. £ £ 1884 6-5 8-1 1899 (1-3 2-2 5 5-9 9-3 1900 7-1 2-5 6 5-9 10-2 1 6-7 3-3 7 6-1 7-6 2 6-8 4-0 8 6-4 4-6 3 6-9 4-7 n 6-7 2-1 4 71 6-0 1890 7-0 2-1 5 7*7 5-0 1 6-5 3-5 6 8-7 3-6 o 6-0 6-3 7 9-7 3-7 3 5-7 I'O 8 8-5 7-8 4 5-6 6-9 9 8-5 7-7 5 5-8 5-8 1910 96 4-7 6 6-1 3-4 1 10-0 30 / 5-9 3-5 o 10-7 3-2 8 5-8 2-9 3 11-4 2-1 (4) Apply some test by which the figures in the previous table can be used to determine whether unemployment (as there measured) increased or diminished in the 30 years. (5) Exliibit the difficulty of comparing nations, in respect of poA\er and prosperity, by means of statistics relating to [n) the number of population, (6) occupations, (c) criminality, {d) exports and imports. T 288 StAtlStlCS (6) Draw up, with careful attention to form and detail and showing all sub-totals, a blank table in which could be shoA\-n, for the years 1919 to 1923 inclusive, the numbers of students Avho entered for the Final Examination for B.Sc. (Econ.), distinguishing Internal and External Students, Pass and Honours Candidates, and the results of the examinations (Pass or Fail in the case of Pass Candidates and Honours I, II, III, Pass Degree. Fail in the case of Honours Candidates). (7) Define the geometric mean, and discuss its use in forming index numbers of prices. (8) The average prices of wheat and the quantities sold at four markets are given as follows : — Market. Average Price per Qr. Quantity sold, Qrs. A B C D 27s. 3d. 28s. 8d. 29s. Id. 279. 2d. 36,000 1,000 16,000 12,000 Find the mean price for the four markets, weighting each local average with the quantity sold. Would it be possible for the average price at each of the above markets to rise from one year to the next and yet for the weighted mean price to fall ? If so, under what conditions ? (9) Illustrate the necessity for standardisation when hetero- geneous groups are in question by describing the methods of comput- ing standard bii'th- or death-rates or family food-consumption. (10) The following are the Annual Premiums required to secure at death £1000 plus a Guaranteed Reversionary Bonus of £2 per cent, on the sum assured under the Whole Life Policies of a certain Assurance Company : — Age next Birthday. Annual Premium. 25 30 35 40 45 £ s. d. 24 12 6 27 14 2 31 11 8 36 7 6 42 6 8 Find by any method of interpolation what the jH-emium would be at age 36 next birthday. MISCELLANEOUS EXAMPLES 2S9 (11) Explain why the method of measuring the mortality from any disease by the proportion of deaths from that disease to deaths from all causes is essentially fallacious. Criticise the following mode of argument in a recent blue-book, containing anthropometric data with respect to school children : The gradation in weight from the poorest group up to the wealthiest is one of the most striking features of the tables. If A\e take all the children of ages from 5 to 18, we find that the average weight of the boy from a one-roomed tenement is 52-6 lb. ; of the boy from a two-roomed tenement, oG-l lb. ; of the boy from a three-roomed tenement, 60-6 lb. ; of the bov from a tenement of four rooms or more, ()4-3 lb. (12) Show how to measure the ' trend ' and the ' fluctuation ' of a series of numbers relating to economic phenomena, such as trade or employment. (13) Find the average age and the median age of the married men included in the table below, and calculate one measure of dispersion. Ages. Maiiied ."Men. Widowers. NuRiber of Men Average Age of Children Number of Men OOO'a. under 10. OOO's. Under 20 I ■47 20— 34 •61 25— 99 •97 1 30— 132 W,0 o 35- 139 1^99 3 40 138 1-98 5 45 130 b53 7 50— 104 •95 9 55— 78 •48 11 60— 53 •20 13 65- 33 •09 13-5 70— 15 •06 10^5 75— 6 •04 7 80— 2 •04 4 (14) What do you understand by a weighted average ? Estimate the average number of children of married men of all ages from the data in the above table. (15) Estimate the number of married men between the ages 52 and 53 in the same table, and also estimate at what age the average number of children is a maximum. Illustrate each estimate by a diagram. (16) Define frequency group and standard deviation. Show that, if m'2 is the second moment of any frequency group about any 290 STATISTICS origin and rn., the second moment about the average of the group, and X is the average measured from the origin, then m^^^m'^^x^- Calculate the standard deviation of the ages of widowers sho^\^l in the table of question (13). (17) State the product-sum formula for the correlation coefficient, and prove that if the means of rows and of columns of the correlation table lie on two straight lines the equations to these lines are '.(/. ij=zr~^x, respectively, x and y being deviations of the variables from their arithmetic means, r the correlation coefficient, and crj, o-y the standard deviations. (18) Below are given the populations of the County of London and the four surrounding administrative counties at the Censuses of 1891 and of 1901 :— Count}-. Population. 1891. 1901. London Essex .... Middlesex . Surrey Kent .... Total . 4.228,317 578,471 542.894 419,115 807,328 4.536,541 816,640 792,314 519,654 936,240 6,576,125 7,601,389 (a) Assuming a constant percentage-rate of increase in each administrative county, estimate its population in 1896 at a date midway between the two Censuses. (6) Assuming a constant percentage-rate of increase for the area as a whole (London and the four surrounding counties), estimate the total poi:)ulation at the same date in 1896. Why does your estimate (6) differ from the sum of the estimates under {a) ? (19) Give as exact a definition as possible of the term ' Cost of Living.' How far can the change in the Cost of Living be measured over a period in which there have been considerable modifications of diet or other changes in consumption of necessary commodities ? (20) Discuss the methods of presenting A\age statistics by averages. Illustrate by a diagram the following data : — Building Trades. — Men. Full time earnings. Median, 37s. ; Quartiles, 29s. 6d., 40s. 6d. ; 5-9 per cent, received less than 20s. ; 2-8 per cent, received 45s. or more. Estimate the average wage roughly. MISCELLANEOUS EXAMPLES 291 (21) Construct a diagram to show graphically the relationship between yield of corn and rainfall from the data in the table below. Years. Yield per acre Rainfall in Years. Yield per acre Rainfall in July, in inches. of corn, in bushels. July, in inches. of corn, in bushels. 1886 24-5 Mo 1896 40-5 617 7 19-2 2-40 7 32-5 3-59 8 35-7 3-83 8 300 2-84 9 32-3 4-45 9 360 3-42 1890 26-2 203 1900 37 0 415 1 33-5 1-88 1 21-4 2-63 o 26-2 3-71 2 38-7 4-78 3 25-7 2-20 3 32-2 341 4 28-8 1-58 4 36-5 5-23 5 37-4 601 5 39-8 4-78 (22) Find the correlation between peld of corn and rainfall in the above table. (23) Define (a) arithmetic average, (6) geometric average, (c) median, (d) mode, {e) quartile. Instance cases when (6), (c) and ((/) are speciallj^ appropriate. (24) Comment on the form of grouping adopted in (26) Table I., and state any inconveniences that it presents. Calculate approximately the values of the median and quartiles, using a graphic method. (25) Explain what is meant by the skewness of a frequency dis- tribution. Give Pearson's measure of the skewness, and any other way of measuring it. Obtain some measure of the skewness of the distribution in (26) Table II. (26) Table I., showing the number of civil parishes in England and Wales in which the population at the Census of 1901 lay between the limits given in the column on the left : — Population. Number of Civil Parishes. Population. Number of Civil Parishes. 1,557 842 2,411 413 241 273 None 1 and under 50 50 „ 100 100 „ 200 200 „ 300 300 „ 400 400 „ 500 25 812 1,339 2,503 2.036 1,410 1,038 500 and under 750 750 „ 1,000 1,000 „ 5,000 5,000 „ 10,000 10,000 „ 20,000 20,000 and upwards Total No. of CivU Parishe.^ 14,900 292 STATISTICS Table II., showing the number of rooms measured, in a certain investigation, in which the size lay between the limits given in the column on the left ; area calculated to the nearest square foot : — Write a short account of the use of graphic methods in statistics. Draw diagrams representing the data of Table I. and of Table II. (27) Define the standard deviation, and show that the mean square deviation is least when deviations are measured from the aiithmetic mean. Fmd the mean and standard deviation for the sizes of the rooms given in (26) Table II. (28) What corrections are applied to the crude death-rates of areas in order to obtain comparable rates ? (29) (1) Estimated average weekly wages of agricultural labourers in thirtj^-six counties of England in 1891. and (2) the percentage of the population in receipt of poor law relief, in rural unions of the same counties, on 1st January of the same year : — Percentage in Receipt of Relief. County. \Yages. s. d. 1 18 6 o 18 0 3 17 0 4 17 0 5 16 0 6 16 0 / 15 G 8 15 6 9 15 0 10 15 0 11 15 0 12 15 0 1-7 2-3 2-5 21 30 21 2-8 2-7 3-5 31 31 $>-7 County. Wages. s. a. 13 14 8 14 14 0 15 14 0 16 14 0 17 13 0 18 12 0 19 12 0 20 12 0 21 12 0 22 12 0 23 12 0 24 12 0 Percentage in Receipt of Relief. County. Waj. es. Percentagf> in Receijit of Relief. s a. 3-6 25 12 0 4-9 31 26 12 0 4-7 40 27 12 0 39 2-3 28 11 6 40 2-8 29 11 6 4-5 4-5 30 !1 6 4-2 4-7 31 11 0 5-2 4-7 32 11 0 4-2 5-7 33 10 6 4-2 41 34 10 0 3-2 4-9 35 10 0 4-4 4-2 36 10 0 4-8 MISCELLANEOUS EXAMPLES 293 Define the arithmetic mean, the median and the mode, and give a sketch of a skew frequency distribution showing the approximate position of each. iState the chief advantages of the arithmetic mean as a form of average, and find the arithmetic mean and the median for the wages of agricultural labourers in the above table. (30) p]xplain clearly the meaning of the term ' dispersion,' and find the mean deviation from the median for wages in the same table. (31) Also, define standard deviation, and find the standard deviation of these wages. (32) Using the data in question (29), test graphically, with squared paper, the correlation between average wages and percentages of the population in receipt of poor law relief, stating your conclusions in Avords. (33) Construct a blank table, complete with headings and Unas, and with due regard to spacing, in which could be inserted the numbers of persons employed in six groups of industries, four grades of age at three different periods. (34) The following table gives for 780 weeks the call discount rate and the ratio of reserves to deposits in New York. Calculate the average discount rate for the various ratios of reserves to deposits, and ex2:)ress the results graphically. Call Discount Rates. 1- 2- 3- 4- .5- 6- r s- 9 10 12- 15- 20- 25- Totals. Ratio of Reserves to Deposits. 217,- 237,- 257,- 277,- 29°/,- 317,- 337,- 357,- 37°/.- 397;- 417o- 437,- 457.- "6 25 47 36 22 18 36 20 2 2 72 87 26 6 2 2 n 57 31 11 1 1 1 33 42 13 4 3 1 1 30 36 3 1 '2 27 16 ■9 t 2 ... '3 6 1 1 "3 2 4 1 "i >7 i "i ... 1 0 "i ... ... ... ... 3 10 127 239 162 89 46 24 20 36 20 2 2 Totals, . 214 195 109 97 72 45 18 10 4 7 1 3 1 4 780 The heading 15- covers all rates of 15 and over but less than 16, etc. From the folloA\ing data find the equation of the regression line giving the average discount rate for all ratios of reserves to deposits, y\ 2d4 STATISTICS and plot the line on the same diagram. Is the use of the product moment metliod of determining correlation justifiable in this case ? Means. Standard Deviations. Correlation. Call Discount Rates Ratios of Reserves to Deposits 3-6 30-3 2-5 4-2 } -.. (35) As an illustration of the nature of definitions in statistics explain fully the meaning of the statement : ' The total value of exports (produce and manufactures of the United Kingdom) in 1918 was £498,473,065.' MISCELLANEOUS EXAMPLES 295 PART II (1) Give a short aocount of the cliief offieial publications, in England, relating to statistics of one of the following subjects, with especial reference to the source and the precise meaning of the data : — (a) Vital statistics (births, deaths and marriages). (6) Foreign trade, (r) Agriculture. (2) What do you understand by the words " frequency group ' ? England and Wales, 1911 Ages . . . .10-15-20- 25- 35- 45- 55- 65- All Occupied (Males 000s.) 246 1164 1146 2225 1815 1262 723 299 Coal Hewers (00s.) . . 63 338 538 1067 798 467 212 50 Comj)ute suitable a^erages and measures of dispersion for the comparison of the age groups in this table and comment on the results. (3) What means are available for testing the significance of differences between statistical coefficients ? Test whether the differences between the means and measures of dispersion for the two series given in the previous question are significant. (4) 5^=4-53 and .s.,=3-71 are the standard deviations of two groups, .Tj, .i;^ . . . .T„, and y^, y.^ . . . y„. Sxy =^S32. n^lOOO. Explain exactly the meaning of standard deviations. Calcu- late the product-sum coefficient of correlation between the groups, and state what it measures. Write down the probable error of the coefficient and explain its meaning. (5) Find the standard deviation of the differences between corre- sponding values of two variates x and y. (6) Set out in detail the method by which you would make graphic comparisons of two such series of figures as Imports of Manufactures and Unemployment. (7) If the recorded births in a certain district may be in defect by X per cent., and the estimated population in error Ijy iy per cent., find an approximate expression for the greatest possible error in the birth-rate, x and y being assumed fairly small (say, not more than 5 per cent, or so). (8) Given five thousand different figures — e.g. quotations of prices, or measurements of human statures — how A\ould you (a) select five hundred figures at random from that total, and (b) ascertain the probability that the average of the five hundred selected figures does not differ from the average of the five thousand bj^ more than any assigned extent ? U 296 (9) (a) STATISTICS >•' amber of Persons per Tenement. Number of Rooms per Tenement. Total. Approx. Average Numt)er of Rooms. 2 3 4 5 6 7 8 9 10 or more. 1 2 3 4 6 6 7 8 26 14 5 4 3 2 9 24 24 16 18 8 5 1 8 60 61 57 36 21 18 14 5 40 57 44 42 43 14 11 1 29 34 27 21 }? 10 4 1? 16 6 4 5 1 8 8 3 5 5 3 2 1 3 3 2 5 3 2 2 1 4 5 10 12 7 4 4 56 191 208 179 148 112 62 44 3-5 4-8 5-0 5-2 5-3 5-4 5-5 5-7 Totals, . Average Number of Persons, M 2-07 105 3-56 275 394 256 4-24 152 4-24 55 3-78 35 414 21 4-62 47 481 1000 3-976 51 Standard deviations : persons 1-83, rooms 1-9 (approx.)- (6) Show that the coefficient of correlation can be expressed in the form 1 [~^{xy)-xyj, where x, y are the averages of the observations referred to any origin. (c) Calculate, by any method, a measurement of the correlation between the number of rooms and number of persons per tenement shown in (a). (d) Calculate the third and fourth moments of the frequency curve of persons ; determine the position of the mode and also determine the skeAvness by any method knoTMi to you. (10) The table given below gives the results of the measurement of series of 959 Oxford Students and of 2348 convicts. Find what, if any, differences between the statistical constants given are significant and comment on the results. Character. Data. Means. Standard Deviations. Coefficients of Correlation with Stature. Head length f.Students K'onvicts 196-05 192-44 6-23 6-39 •31 •26 Head -breadth /students \Convicts 152-84 151-02 4-92 5-49 •14 •15 Head -height /Students 1^ Convicts 136-62 132-29 5-80 5-21 •28 •19 Stature /Students \ConA'icts 69-49 65-44 2-60 2-65 — MISCELLANEOUS EXAMPLES 297 (11) Cive a short account of the nature of the information con- tained in one of the following : Census of Production, 1907 ; Reports on Wages in 1906 (the ' Wage Census ') ; Reports on Buildings and Tenements (housing and overcrowding) in the Population Census, 1911. (12) Outline a method by which the normal curve of error can be obtained as the limit of {p-\-q)". (13) Discuss (a) the best means of obtaining accurate statistics of family expenditure, and (6) the best means of combining such data so as to form a representative type. (14) Define the following terms and give illustrations of their use : interpolation, standard deviation, moment, skewness, logarithmic scale, geometric mean, partial correlation, normal curve of error. (15) In m trials an event has happened r times. How vi ould you determine the probability that this result is consistent with the hypothesis of random sampling from a universe in which the chance of the event happening is a certain small quantity p ? Why cannot the required probability be derived from a table of the normal curve of error ? (16) If m^, m.^ are the numbers of deaths occurring in a year among Nj, Ng persons of two different occupations, the standard deviation by which the significance of the observed difference in the death rates per 1000 can be tested is given as iooo^{^^^(^7^^^+^^^^^7^^^}. Show how this formula is obtained and criticise it. (17) Contrast the methods used in the construction of any two current index numbers of wholesale prices. Under what conditions is ' weighting ' important in index numbers 1 (18) Analyse in some detail the cases in which it may be assumed (1) that a frequency distribution is normal, or (2) that the proba- bility of errors in a measurement or observation exceeding various amoimts is determinable by the normal table of probability. (19) What methods are available for testing the ' goodness of fit ' of a mathematical curve to observations ? (20) If z=a;i-(-a;.2+ • • • +-'^h, where the .r's are deviations from the average of quantities selected at random and independently of each other from a curve frequency whose standard deviation is o-, show that the standard deviation of z is r-^, n being finite. sin Under what circumstances can it be shown that the curve of frequency of z is normal ? (21) What methods are available for classifying frequency curves into types ? State briefly the mathematical concepts underlying the Pearsonian classification of frequency curves. 298 STATISTICS (22) Explain how the necessity for Sheppard's corrections of moments arises. If Wj is the second moment calculated from observations when all are supposed to be grouped at the middle of grades whose breadth is h, show that Wo+ -^ is the second moment if it is assumed that the 2 12 observations are evenly distributed through the grades. (23) A sample containing 1000 is drawn at random from a large universe and 300 are found to possess a certain attribute. Can you infer anything as to the proportion in the universe that have this attribute, or what further information is needed ? (24) Discuss the effect on a weighted mean of errors in the quantities or the weights. (25) Write a brief note on the assumptions made in calculating the probable error of a statistical quantity, such as the standard deviation or the correlation coefficient. (26) Calculate the average, second, third and fourth moments, mean deviation, standard deviation and skewness of the frequencj^ groups of chest girths shown in the following table : — Height and C HE9T Girth or 1126 Rkc RUITS OF L8Y •-AR-S OF A QE. Chest Girth in Inches. Height in Inches. Totals. tiO til 62 63 64 65 66 67 OS 69 70 71 72 28 1 1 29 1 1 ... 2 30 i "i 3 1 1 2 9 31 2 !) 8 3 4 6 7 3 0 '2 i 47 32 8 18 24 29 36 12 16 5 8 2 1 159 33 6 U 21 30 42 22 36 17 8 9 i 3 i 207 34 6 16 1.5 43 52 43 21 40 28 12 1 3 280 35 6 15 25 32 32 2;i 29 18 16 14 2 i 219 36 4 3 6 11 22 18 18 18 19 U 2 127 37 1 3 1 "4 12 6 8 6 G 1 "i 49 38 •> 1 1 1 4 3 4 4 2 22 39 ... 1 1 2 4 Totals, . . 23 68 91 140 180 143 144 121 97 71 30 14 4 1126 Averages of Arrays, 32-7 33-6 33-5 34-2 34-1 34-7 34-7 350 351 35-5 36-3 35-4 34-5 34-51 Standard Deviations of Arrays, . 1-75 1-87 1-57 1-34 1-33 1-50 1-76 1-40 1-77 1-67 1-3 1-8 2-2 1-66 Average height, 65-6 ; standard deviation, 2-52. Compare the relations between the quantities calculated with those that are found in the normal curve of error. (27) From the above data draw the regression line (chest girth on height), and with the help of your drawing find an approximate value of the coefficient of correlation between height and chest girth. In normal distributions the standard deviation of an array is o-j Jl—r^ where a-^ is the standard deviation of the arrays merged in one group. Are the standard deviations shown in the table con- sistent mth this formula ? INDEX [ Tht numheri^ refer to payei ; I and II refer to the two parti of the hooh.'\ Abscissa, I 68. Abstract of Labour Statistics, I 19 ; II 177, 213, '283. Advancing Dirt'erences, I 92. Age Distribution of Criminals, I 04. Anemone Neraorosa, II 227-30. Arithmetif Mean (•'*ee Mean). Array, I 109, 122-3, 127; II 251, 256-7, 262. Assortative Mating, II 168. Average, I 4, 9-11, 22-41, G3, 115, 125; II 136, 155-7, 173. Bernoulli, II 149, 248. Beveridge, I 15, 81. Biassed and Unbiassed Errors, II 154-6, 170, 179. Binomial, I 91 ; II 141-3, 146-7, 151, 181, 187-8, 190, 232-5, 240. Biometrika, I 51, 103, 109, 114; II 134, 161-3, 165-7, 196. 212, 227, 230. Birth Rate, I 7, 32, 82 ; II 234. Board of Trade Index Number of Prices, I 39. Board of Trade Joxcrnal and Com- mercial Gazette, II 282. Booth, II 279. Bowley, 137; II 170-3, 280. Boyle's Law, I 75. Brain-weight, I 103-4 ; II 166. Bravais, 1 114. Burnett-Hurst. II 173. Calcdlation of Mean and Standard Deviation, I 52-5. Cattle and Grass-land, I 120-5. Census, I 1, 14-15, 20, 40, 85, 117; II 177, 280-2. Central Statistical OfKce, II 279. Centre of Gravity, I 60 ; II 269. Chance, 14; II 134, 136. Charlier, II 248. Classification and Tabulation, I 4, 14-21. Class-interval, I 21, 52, 103, 120-3; II 195, 213, 218. Coefficient of Correlation, 1 108, 110-12, 120, 129, 131 ; II 133, 152, 158, 162-3, 168-9, 253, 260, 278. Coefficient of Variation, I 50-51 ; II 1.S3-4, 162-4, 167-8. Compound Interest, I 7 ; II 263. Condition of School Children, 117. Constancy of Great Numbers, I 1. Consumers' Surplus, I 97, 101. Continuous Variation, I 26 ; II 182, 195-6, 232. Co-ordinates, I 68. Corrected Death Rate, I 32-5. Correlation, I 4, 18, 76, 78, 82, 102-31 ; II 151-3, 166-7, 253. Correlation Ratio, I 112, 114; II 163. Cost of Living, I 8, 29, 35-8, 125-31 ; II 281-2. Criminal Anthropometry, II 166. Crude Death Rate, I 32-5. Cuckoo's Eggs, II 161. Cunynghame, I 95. Curve Fitting, II 179-230. Darwin, I 3. Death Rate, I 7, 32-5. 82 ; II 234. Deciles, I 49. Decreasing Return, I 99-100. De Moivre, II 149. Descartes, I 68. Dispersion, I 4, 42-51, 108; II 145, 162. Distribution of Random Digits, II 157. Dividend Sample, II 171. Economist, I 9-11. Edgeworth, I 3. 114; II 172, 230 248. Elasticity of Wire, I 73. Elderton, II 190, 203, 230. Ellipse, I 69 ; II 252, 257-62. Error, I 105. Errors of Observation, I 4, 73 ; II 233. Euleri II 149, 248. Examination Marks, I 25, 27-8, 52-5. 66-7, 93 ; II 175-6, 207-212. 300 STATISTICS Fechner, II 248. Fermat, II 149. Fitting of Curves, II 179-230. Fitting with a Parabola, 1 8S-9. Fluctuations of Sampling, I 47. Flux, I 39. Frequency, 1 12, 54, 60, 62, 102-3, 122-3, 127; II 135, 143, 150-3, 157, 172, 178-9, 184, 195-6, 200, 238-9. Frequency Curve, I 60 ; II 178-230. Frequency Distribution, 1 11-13, 52-67 ; 11 194, '267. Frequency Polygon, II 180, 182, 184, 195. Frequency Surface, II 249-62. Funoti<»n, I 72, 87. C4ALT0N, I 3, 49, 108-9, 114; II 248. Gaiuma Function, II 214. Gauss, II 248, 273. Generalised Probability Curve, II 187. Geometric Mean, I 38-9 ; II 264-5. Goodness of Fit, II 180, 212, 219, 223- 224. Graphs, I 57, 68-101. Hain, I 3. Halley, I 1. Harmonic Mean, I 39 ; II 264-5. Height of School Children, I 18. Herschel, II 234. Histogram, I 60; II 180, 219-20, 223, 225, 227, 232. Homogeneity, II 165-6. Household Budgets, I 35-6; II 155, 266. Hyperbola, I 69. Hj'pergeometiieal Expansion, II 187. Tncreasikg Return, I 100. Index Numbers, I 8-11, 29-31, 35-9, 76- 80 ; II 156. Indictable Ofl'ences and Unemploy- ment, I 82-3. Infant Mortality, I 65, 117-19. Infectious Disease Rate, I 56, 62 ; II 176, 219-27. Inheritance, II 162, 167-8. International Statistical Congress, I 2. Interpolation, I 24, 48-9, 76, 85-94 ; II 212. Journal of tht Royal Statistical Society, I 8, 37 ; II 170, 230. Kapteyn, II 248. Knapp, I 3. Labour Gazette, I 120 ; II 282. Lagrange, II 149. Lagrange's Interpolation Formula, I 94. Laplace, II 149, 248. Latter, II 161. Least Squares, I 105-6 ; II 196, 271-3. Lee, I 109 ; II 167. Levis, I 3. L'homme Moyen, I 2. Limits for Correlation Coefficient, I 110. Lipps, II 248. Livelihood and Poverty, II 173, Logarithmic Curve, I 87. London Statistics, I 83, 117. McAlister, II 248. Macdonell, I 51 ; II 166. Maclaurin, I 87 ; II 187. Marriage Rate, I 32, 77-81, 115-17; II 234. Marshall, I 95. Mean— Arithmetic Mean, I 22, 30, 38- 41, 52-5, 60, 62-5, 104-13, 116-19, 122 5, 127-9; II 132-4, 144-5, 147-8, 153, 157-8, 160-4, 167-9, 193, 201, 204-5, 208, 217, 223, 226, 228, 238, 251. 256-8, 265-9, 273. Mean Deviation, I 42-6, 50, 55, 64 -. II 246, 270-1. Mean Error, II 245. Mean-square Deviation, I 46-7, 54 ; II 211, 268-9. Median, I 23-6, 36, 38-41. 43-6, 60-7. 87 ; II 162, 204, 238, 270. Median Error, II 245. Meteorological Observations, I 83-4. Minority Report on the Poor Law, I 15. Mode, I 26-9, 36, 38-41, 60-2, 64, 66-7 : II 162, 186, 190-3, 204-5, 217, 223, 226, 228-9, 238. Moments, I 123-4, 127; II 152-3, 163, 187, 194-205, 207-8, 213-16, 220-22, 225-6, 228, 267-9. Monopoly, I 101. Monthly Bulletin of Statistics, II 283. Non-linear Regression, I 112, 114. Non-measurable Characters, I 6 ; II 162. Normal Curve of Error, 14; II 133, 184, 193, 206-12, 231-48. 251-2, 256. Normal Demand, I 99. Occupational Death Rate, I 34. Ogive Curve, I 66. Ordinate, I 68. Overcrowding, I 20, 117-19. Paish, I 9. Parabola, I 69, 73. 88 9 : II 188. 197-9. Pascal, II 149. Pearl, I 103 ; II 166. Pearl and Dunbar. I 51. Pearl and Fuller, II 134. INDEX 301 Pearson, I 3, 50-1, 60-1, 108-9, 112, 114; II 162, 165, 167, 184, 187, 191, 212, 230, 248, 255, 276. Petty, I 1. Philosophical Transactions, II 162. Plotting of a Frequency Distnbution, I 65-60. Point of Inflexion, II 238, 245. Poisson, II 248. Poor Relief, I 88-92, 120. Population according to Age, I 20. Prediction, II 169. Prices. I 8-11, 35-9, 76-81, 115-17, 125- 31 ; II 235, 265, 281-2. Probability, I 3-4 ; II 132-49, 151, 181, 187, 231, 236-50. Probabilitv Curve, 14; II 184, 187, 236. Probability Integral, II 209, 239, 245, 275. Probable Error, II 133-4, 145, 150-64, 245-6. Probable Error of Mean, II 153-4. Probable Error of Sum or Difference, II 157-8. Proceedings of London Mathematical Societ7/, II 203. Producers' Surplus, I 98, 100-1. Product Daviation, I 106, 113, 116, 118-9, 123-5, 127-9. Production Census, II 177, 282. Progress of the Nation, I 88, 94. QuADRATUKE Formulae, II 197, 209, 218 230 Quartile, 148-9, 64, 66-7, 87 ; II 245. Quartile Deviation, I 48-50, 55, 64-5 ; II 245-6. Questionnaire, I 15. Quetelet, I 1-3. Random Errors, I 4. Random Sampling {see Sampling). Registrar-General, 13, 14, 33, 55; II 280, 283. KegreBsion, Coefficient of. Line of, . I 108-12, 117, 119, 125-6, 129-31; I] 162, 253, 257, 260-1. Rent, I 29-31, 101, 125-31 ; II 281. Rowntree, II 173, 279. Eoynl Society Transactions, II 230. Sampling, Random Sampling, I 4, 47; 11 l.''.2-4, 145-77, 179, 182, 212, 236, 243. Sauerbeck, I 8-10, 35. Schuster, II 163. Secular Trend, I 80. Sheppard's Adjustment, I 104 ; II 203, 208, 213, 225. Short-time Fluctuations, I 78, 80. Significance, II 133, 159-68. Simpson's Rule, II 219, 230. Skew, Skewness, I 4, 49, 61-4; II 146, 180, 184, 186-7, 190-2,205,235, 247. Sleep and Physical Condition, I 19. Smooth Curve, I 60, 86 ; II 234. Social Statistics, II 279. Standard Deviation, I 46-7, 50-5, 63, 106-13, 116-19, 122-5, 127-9; II 133-4, 145-8, 151-4, 157-64, 166-8, 171-2, 176-7, 207-9, 240-6, 251, 256-8, 266-9. Standard Error, II 134. Standard Population, I 34. Statical Analogues, II 269, 271. Statist, I 9, Stirling, II 248. Successful Events, II 136-8. 140-8, 151-4. 169, 171, 174-5, 231, 240. Successive Differences, I 91. Supply and Demand Curves, I 95-101. Siissmilch, I 1. Symmetrical Distribution, I 61 ; II 180. Table of Probable Errors, II 163. Tawney, I 40 ; II 158. Tax, I 100-101. Temperature Records, I 65. Todhunter, II 149, 248. Unemployment, I 15, 81-3; II 213-19. Variability, I 42-51, 61, 63, 107-108, 111 ; II 132-4. 162, 165-6, 247-8. Variable, I 6-13, 72, 75, 87, 116, 122; II 263-9. Variation, I 6. Variation in Arcella, 151. Variation in the Earthworm, II 134. Variation in Eupagurus Prideauxi, II 163. Variation in Plants, II 167. Wages, I 8, 40-2, 125-31 ; II 173, 224, 227, 234, 281-3. Weighting, I 30-8. Weighted Mean, I 29-38, 263 4. Yule, I 114; II 227. Uniform with this Volume BELL'S MATHEMATICAL SERIES (Advanced Section) General Editor: WILLIAM P. MILNE, M.A., D.Sc. Professor of Mathematics, Leeds University AN ELEMENTARY TREATISE ON DIFFERENTIAL EQUATIONS AND THEIR APPLICATIONS By H. T. H. PlAGGiO, M.A., D.Sc, Professor of Mathematics, University College, Nottingham ; formerly Senior Scholar af St. John's College, Cambridge. Third Edition. Demy 8vo. I2.r. net. The earlier chapters contain a simple account of ordinary and partial differential equations, while the later chapters are of a more advanced cllaracter, and cover the course for the Cambridge Mathematical Tripos, Part II., Schedule .'\, and the London 15. Sc. Honours. The number of examples, both workctl and unworked, is very large, and all the answers are given. 'With a skill as admirable as it is rare, the author has appreciated in every part of the work the attainments and needs of the students for whom he writes, and the result is one of the best mathematical text-books in the language' — MatJieinatical Gaze/te. A FIRST COURSE IN NOMOGRAPHY By S. Brodki.skv, M.A., B.Sc, Ph.D., Professor of Applied Mathe- matics at Leeds University. Demy 8vo. los. net. Graphical methods of calculation are becoming of ever greater importance in theoretical and industrial science, as well as in all branches of engineering piMctice. Nomography is one of the most powerful of such methods, and the object of this book is to explain what nomograms are, and how they can be constructed and used. The book caters for both the practical man who wishes to learn the art of making and using nomograms, and the student who desires to understand the underlying principles. It is illustrated by sixty-four figures, most of which are actual nomograms, their construction being analysed in the text. In addition, there are numerous exercises illustrative of the principles and methods. 'A good introductory treatise . . . calculated to appeal to the student who desires to make early practical use of the knowledge he acquires.' T/ic Mfchanicnl World. ELEMENTARY VECTOR ANALYSIS With application to Geometry and Ph) sics. By C. E. Weatherburn, M.A., D.Sc, Professor of Mathematics, Canterbury University College, Christchurch, N.Z. Demy 8vo. I2J-. net. This book provides a simple exposition of Elementary \'ector Analysis, and shows how it may be employed with advantage in Geometry and Mathe- matics. The use of \'ector Analysis in the former is abundantly illustrated by the treatment of the straight line, the plane, the sphere and the twisted curve, which are dealt with as fully as in most elementary books. In Mechanics the author has explained and proved all the important elementary principles. LONDON: G. BELL AND SONS, LTD, BELL'S MATHEMATICAL SBRiES— Co ntd. (Advanced Section) ADVANCED VECTOR ANALYSIS With Application to Mathematical Physics. By C. E. Weather- KURN, D.Sc, Professor of Mathematics, Canterbury University College, Christchurch, N.Z. Demy 8vo. I5.f. net. 'The author has already published in this series an Elementary Vector Anaivsis. In this companion volume he deals with the higher part of the subject and the applications of the theory, including a chapter on the Equa- tioins of Maxwell and Lorentz and the Lorentz-Einstein transformation.'— Times. THE ELEMENTS OF NON-EUCLIDEAN GEOMETRY By D. M. Y. SOMMERVILLE, M.A., D.Sc, Professor of Mathematics Victoria University College, Wellington, N.Z. ^s. 6d. net. 'An excellent text-book for all students of Geometry.' — Nature, 'A useful and stimulating book.'-- Mathematical Gazette. ANALYTICAL CONICS By D. M. Y. SOMMERViLLE, M.A.. D.Sc. Demy 8vo. 15.C. net. In this book the elementary properties of the Conies are first of all dealt with, and thereafter the higher portions of the subject, such as Conies referred to any axes, Homogeneous Co-ordinates, Invariant and Covariant ]iroperties of Conies, etc. There are abundant collections of examples in all the subjects treated. A Professor of Mathematics writes: 'I find it a work entirely praise- worthy ; and indeed of such excellency as one would expect from the pen of the author of The Elements of Non-Euclidean Geometry. The variety of topics treated is more extensive than that to be found in most existing text- books on the subject. In certain instances the treatment is refreshingly novel, and in all cases the presentation is concise and lucid.' A TEXT-BOOK OF GEOMETRICAL OPTICS By A. S. Ramsey, M.A., President of Magdalene College, Cambridge. New and Revised Edition. Demy 8vo. 8^. bd. This is a revised edition of the work published in 1914. It contains chapters on Reflection and Refraction, Thin and Thick Lenses, and Com- binations of Lenses, Dispersion and Achromatism, Illumination, The Eye and Vision, Optical Instruments, and a chapter of Miscellaneous Theorems, together with upwards of three hundred examples taken from University an College Examination Papers. The range covered is somewhat wider tha. that recjuired for Part I. of the Mathematical Tripos. LONDON: G. BELL AND SO N S, LTD. York House, Portugal St., W.C. 2. A 5^^